=Paper= {{Paper |id=Vol-3602/paper3 |storemode=property |title=A (Dis)similarity Index for Comparing Two Character Networks Based on the Same Story |pdfUrl=https://ceur-ws.org/Vol-3602/paper3.pdf |volume=Vol-3602 |authors=François Bavaud,Coline Métrailler |dblpUrl=https://dblp.org/rec/conf/comhum/BavaudM22 }} ==A (Dis)similarity Index for Comparing Two Character Networks Based on the Same Story== https://ceur-ws.org/Vol-3602/paper3.pdf
                                A (Dis)similarity Index for Comparing Two Character
                                Networks Based on the Same Story
                                François Bavaud1,2 , Coline Métrailler1
                                1
                                  Faculty of Arts, Department of Language and Information Sciences, University of Lausanne, bâtiment Anthropole,
                                1015 Lausanne, Switzerland
                                2
                                  Faculty of Geosciences and Environment, Institute of Geography and Sustainability, University of Lausanne, bâtiment
                                Géopolis, 1015 Lausanne, Switzerland


                                                                         Abstract
                                                                         Comparing networks is always a complicated matter, whose effective implementation strongly depends
                                                                         on the amount of shared information between them, in particular whether nodes, edges, weights etc. are
                                                                         identical, or not. In the case of character networks and adaptations (from book to movie, from movie
                                                                         to theater, and so on), the formal challenge proves stimulating: some characters will be mapped from
                                                                         one work to the other, some will have no correspondence, and their weights, measuring their relative
                                                                         occurence, are bound to differ.
                                                                             This formal contribution, rooted in Multivariate Data Analysis, proposes a presumably novel similarity
                                                                         index, the generalized weighted RV coefficient, taking into account both the difference in character
                                                                         weights (nodes) and in character interactions (edges). This approach first requires to transform the
                                                                         character networks into weighted squared Euclidean configurations. We then compare a novel of C.S.
                                                                         Lewis, part of the series The Chronicles of Narnia, and the script of its film adaptation to illustrate the
                                                                         proposal and the results.




                                1. Introduction
                                Networks of fictional characters often exist in two or more versions. For instance (section 4),
                                network 𝐴 is built from a novel, and network 𝐵 from a movie adaptation. Besides the main
                                characters common to both versions, there are characters proper to a single version only. Also,
                                the importance of common characters (as measured, e.g., by their relative occurrence), is bound
                                to vary between the two versions, as is the strength of their mutual relations (as e.g. measured
                                by their relative co-occurrence). Hence, the networks 𝐴 and 𝐵 differ both along character
                                weights (nodes) and interaction weights (edges).
                                   Our contribution proposes the definition of a single index measuring the overall similarity
                                between 𝐴 and 𝐵. This index, noted RV, constitutes an innovative generalization, involving two
                                distinct sets of object weights, of the weighted RV-coefficient [1], which is itself a generalization of
                                the original, unweighted RV-coefficient [2] (where R did refer to "correlation" and V to "vector").
                                In particular, RV ∈ [0, 1], with RV = 1 iff 𝐴 and 𝐵 are identical (i.e., same character weights and
                                dissimilarities between characters), and RV = 0 iff 𝐴 and 𝐵 have no character in common. This

                                COMHUM 2022: Workshop on Computational Methods in the Humanities, June 09–10, 2022, Lausanne, Switzerland
                                $ fbavaud@unil.ch (F. Bavaud); coline.metrailler@unil.ch (C. Métrailler)
                                 0000-0002-4565-0715 (F. Bavaud); 0000-0002-3196-481X (C. Métrailler)
                                                                       © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                    CEUR
                                    Workshop
                                    Proceedings
                                                  http://ceur-ws.org
                                                  ISSN 1613-0073
                                                                       CEUR Workshop Proceedings (CEUR-WS.org)



                                                                                                                                          33




CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
François Bavaud et al. CEUR Workshop Proceedings                                             33–42


similarity coefficient can in turn be transformed into a dissimilarity coefficient, which can be
additively decomposed into five components.
  The formalism, exposed in section 2, is rooted into Weighted Data Analysis. It involves three
major steps:
   1. Transforming a weighted network into a weighted Euclidean configuration (section 2.1).
      This step is chiefly dictated by the formalism, which requires squared Euclidean dissimi-
      larities, but permits as a byproduct a visualization of the character network (by weighted
      multidimensional scaling) of interest in itself.
   2. Transforming the weighted Euclidean configuration into a kernel, whose eigen-
      decomposition permits to visualize the network nodes (section 2.2)
   3. Computing the generalized RV coefficient (beginning of section 3), assessing the similarity
      between two networks whose node weights may differ, and its exact decomposition into
      five terms (sections 3.1 and 3.2).


2. Visualization of character networks: a few "reminders"
We formalize character networks as weighted ∑︀ networks (f , C), where f is the vector of the
𝑛 character weights, obeying 𝑓𝑖 ≥ 0 and 𝑛𝑖=1 𝑓𝑖 = 1. The 𝑛 × 𝑛 matrix of edge weights
C = (𝑐𝑖𝑗 ) is non-negative, and quantifies the importance of edge 𝑖𝑗, reflecting some kind of
affinity between the characters 𝑖 and 𝑗, such as the their co-occurrence in the present study
(section 4). It can be symmetric (as for C = A, where A is a binary adjacency matrix in
a non-directed network), or not (as for asymmetric social relationships), in which case the
network is directed.

2.1. Extracting Euclidean dissimilarities from a weighted network
Network visualization is a boundless research topic, even restricted as done here to the Euclidean
embedding of networks. Specifically, one seeks to extract from the network data (f , C) a
matrix D = (𝐷𝑖𝑗 ) of squared Euclidean dissimilarities between nodes, that is of the form
𝐷𝑖𝑗 = ‖x𝑖 − x𝑗 ‖2 , where x𝑖 is the representative vector of node 𝑖. Two proposals only, among
many others possibilities (extracting squared Euclidean dissimilarities from a weighted network
is a topic in itself), are considered in this contribution: commute-time distances (section 2.1.1)
and diffusive distances (section 2.1.2).

2.1.1. Commute-time distances
By construction, the square matrix P = (𝑝𝑖𝑗 ) with components 𝑝𝑖𝑗 = 𝑐𝑖𝑗 /𝑐𝑖∙ (where 𝑐𝑖∙ =
   𝑛
   𝑗=1 𝑐𝑖𝑗 ) is non-negative, with 𝑝𝑖∙ = 1 : it therefore constitutes the transition matrix of a
∑︀
Markov chain, defining a random walk on the network. The commute time 𝐷𝑖𝑗      com
                                                                                   is the average
time needed to go from 𝑖 to 𝑗 and then back to 𝑖 [see e.g. 3]. The matrix Dcom = (𝐷𝑖𝑗         com
                                                                                                  )
of commute times is well-known to be squared Euclidean [see e.g. 4], irrespectively of the
properties of C, as far as C is reducible – that is as far as any two states can be directly or
indirectly connected by the random walk.



                                                34
François Bavaud et al. CEUR Workshop Proceedings                                               33–42


2.1.2. Diffusive distances
One considers the edge weights V as generating a so-called instantaneous jump process, per-
mitting to navigate from node 𝑖 to node 𝑗 with a rate given by (minus) the components of the
                             1                        1
weighted Laplacian L = Π− 2 [diag(C1𝑛 ) − C]Π− 2 , where Π = diag(f ), and C must be taken
as symmetric, that is replaced if necessary by 12 (C + C⊤ ). Then choose a diffusion time 𝑡 > 0,
and compute the joint probability 𝑒𝑖𝑗 (𝑡) to be initially in 𝑖 and in 𝑗 at time 𝑡 (or the other way
round) as
                                                    1                1
                             E(𝑡) = (𝑒𝑖𝑗 (𝑡)) = Π 2 exp(−𝑡 L) Π 2
The squared Euclidean diffusive distances Ddiff (𝑡) = (𝐷𝑖𝑗
                                                        diff
                                                             (𝑡)) finally obtain as [see e.g. 5]

                               diff         𝑒𝑖𝑖 (𝑡) 𝑒𝑗𝑗 (𝑡)    𝑒𝑖𝑗 (𝑡)
                              𝐷𝑖𝑗   (𝑡) =       2  +    2   −2         .
                                             𝑓𝑖       𝑓𝑗        𝑓𝑖 𝑓𝑗

2.2. Visualizing a weighted Euclidean configuration by weighted MDS
 Weighted multidimensional scaling constitutes the canonical procedure for the low-dimensional
visualisation of a weighted configuration (f , D) (see figure 1):
   1. define Π = diag(f ), as well as the weighted centering matrix H = I𝑛 − 1𝑛 f ⊤
   2. obtain by double centering the matrix of scalar products B = − 12 H D H⊤
   3. define the matrix of weighted scalar products or kernel as K as:
                                  √      √
                                                                                                   (1)
                                                               √︀
                            K = ΠB Π                    𝐾𝑖𝑗 = 𝑓𝑖 𝑓𝑗 𝐵𝑖𝑗
   4. perform the spectral decomposition K = UΛU⊤ where U = (𝑢𝑖𝛼 ) is orthogonal and
      Λ = diag(𝜆) diagonal
                               √    √
   5. finally, define 𝑥𝑖𝛼 = ∑︀
                            𝑢𝑖𝛼 𝜆𝛼 / 𝑓𝑖 , which is the MDS coordinate of node 𝑖 in dimension
      𝛼. By construction, 𝛼 (𝑥𝑖𝛼 − 𝑥𝑗𝛼 )2 = 𝐷𝑖𝑗∑︀  . Also, the total dispersion or inertia of the
                                                      𝑛                         ∑︀𝑛−1
      weighted configuration (f , D) reads Δ = 2 𝑖,𝑗=1 𝑓𝑖 𝑓𝑗 𝐷𝑖𝑗 = tr(K) = 𝛼=1 𝜆𝛼 .
                                                1




3. The generalized weighted RV coefficient
At this stage, the book and the movie networks of characters have been expressed into commute
time or diffusive kernels (1), namely 𝐾𝐴 and 𝐾𝐵 . Each kernel defines a weighted configuration
(f𝐴 , D𝐴 ), respectively (f𝐵 , D𝐵 ), and conversely. The weighted RV coefficient between both
configurations is defined as [1]
                                CV𝐴𝐵
           RV = RV𝐴𝐵 = √                                where   CV𝐴𝐵 = Trace(𝐾𝐴 𝐾𝐵 )               (2)
                              CV𝐴𝐴 CV𝐵𝐵
and constitues a straightforward generalization of the original, unweighted RV coefficient [2],
providing the cosine similarity between the vectorized matrices K𝐴 and K𝐵 (figure 1).
  In particular, RV𝐴𝐵 ≥ 0 (since K𝐴 and K𝐵 are positive semi-definite: this condition is
equivalent to the squared Euclidean nature of D𝐴 and D𝐵 ), RV𝐴𝐵 ≤ 1 (by the Cauchy-Schwarz
inequality) and RV𝐴𝐴 = 1.



                                                   35
François Bavaud et al. CEUR Workshop Proceedings                                                   33–42




Figure 1: The weighted RV coefficient measures the similarity between two weighted configurations
(f , D𝐴 ) (left) and (f , D𝐵 ) (right) embedded in R𝑛−1 . Here the 𝑛 objects (characters) are endowed with
the same weights f in both configurations, and the object coordinates are obtained from weighted MDS
applied on squared Euclidean dissimilarities D𝐴 , respectively D𝐵 .




  The null distribution of the weighted RV coefficient, i.e. assuming no relationships between
the two configurations, and in particular its statistical significance, have been extensively
investigated during the last decades [see e.g. 6, 1, and references therein].
  The crucial issue here is that the character weights f𝐴 and f𝐵 differ in the two character
networks, whence the naming generalized weighted RV coefficient for the same quantity (2),
where
                             √︀        √︀
                     K𝐴 = Π𝐴 B𝐴 Π𝐴                          Π𝐴 = diag(f𝐴 )
                                1
                      B𝐴 = − H𝐴 D𝐴 H⊤       𝐴             H𝐴 = I𝑛 − 1𝑛 f𝐴⊤
                             √︀ 2      √︀
                     K𝐵 = Π𝐵 B𝐵 Π𝐵                          Π𝐵 = diag(f𝐵 )
                                1
                      B𝐵 = − H𝐵 D𝐵 H⊤       𝐵             H𝐵 = I𝑛 − 1𝑛 f𝐵⊤
                                2
   This circumstance generates new challenging issues, whose investigation was one of the
motivations for embarking on the present piece of research.
   Note that well-established statistical procedures permitting to test the statistical significance
of the weighted coefficient RVℎ are available [see e.g. 6, 1, and references therein]. However,
testing the generalized weighted RV coefficient is, at the present time, a completely open issue.

3.1. Decomposition of the generalized weighted RV coefficient
As mentioned, the relative weights f𝐴 and f𝐵 (set to the uniform weights 1/𝑛 in most applica-
tions of Multivariate Analysis) may differ to a spectacular extent: their supports supp(f𝐴 ) and
supp(f𝐵 ) do not even coincide in general, since version 𝐴 may contain characters absent in
version 𝐵, and vice-versa.




                                                   36
François Bavaud et al. CEUR Workshop Proceedings                                                                                                   33–42


  Define the compromise weight h as
                    √︁
                      𝑓𝑖𝐴 𝑓𝑖𝐵                                                         ∑︁ √︁
               ℎ𝑖 =                                        where            𝑍 =            𝑓𝑗𝐴 𝑓𝑗𝐵 ∈ [0, 1] .                                        (3)
                       𝑍
                                                                                          𝑗

This choice, initially dictated by formal considerations, permitting to further transform the
square roots in (1) into a tractable expression, turns out to be conceptually convenient and
interpretable as well: 𝑍 is a measure of weights dissimilarity, appearing in identity (6) below.
Also, ℎ𝑖 = 0 unless character 𝑖 appears in both versions (figure 3). A little algebra demonstrates
the numerator of the similarity index (2) to express as
                              CV𝐴𝐵
                                     = trace(Kℎ𝐴 Kℎ𝐵 ) + 𝜅𝐴𝐵                               (4)
                                 𝑍2
where 𝐾ℎ𝐴 is the kernel associated to configuration (ℎ, 𝐷𝐴 ) and 𝐾ℎ𝐵 is the kernel associated
to configuration (ℎ, 𝐷𝐵 ). Also,
                                                     ∑︁
                             𝜅𝐴𝐵 = 𝐷ℎ𝑓 𝐴
                                         𝐴
                                            𝐵
                                           𝐷ℎ𝑓𝐵
                                                 +2      ℎ𝑖 ℓ𝐴  𝐵
                                                             𝑖 ℓ𝑖                          (5)
                                                                                      𝑖

where 𝐷ℎ𝑓  𝐴
             𝐴
               is the squared Euclidean distance between the gravity centers of (h, D𝐴 ) and
(f𝐴 , D𝐴 ). The quantity 𝐷ℎ𝑓
                           𝐵 is defined analogously. Naturally, the gravity centers of (f , D )
                             𝐵                                                           𝐴   𝐴
and (h, D𝐴 ) generally differ, as are the gravity centers of (f𝐴 , D𝐵 ) and (h, D𝐵 ), but the
differences are extremely small in the case study (see Figure 6). The second component in (5)
involves a weighted covariance between the h-centered vectors ℓ𝐴 = Bℎ𝐴 f𝐴 and ℓ𝐵 = Bℎ𝐵 f𝐵 ,
where
                              1                                  1
                   Bℎ𝐴 = − Hℎ D𝐴 H⊤       ℎ            Bℎ𝐵 = − Hℎ D𝐵 H⊤      ℎ
                              2                                  2
This second component is again zero if the compromise centroid coincides with the original
centroid in configuration 𝐴 , or 𝐵, or both. In short, the term 𝜅𝐴𝐵 in (5), which can be negative
(as here in the two distance variants), represents a correction due the non-coincidence of the
𝑓𝐴 - and ℎ-centroids in configuration 𝐴 (respectively the 𝑓𝐵 - and ℎ-centroids in configuration
𝐵).

3.2. An exact additive decomposition formula
A similarity coefficient such as the generalized weighted coefficient RV ∈ [0, 1] can be simply
converted into a dissimilarity coefficient 𝑑 ∈ [0, ∞) by 𝑑 = − ln RV. Applying the transforma-
tion to (2), taking into account the previous definitions and performing direct, down-to-earth
algebraic operations finally yields the following exact decomposition for the dissimilarity
between character networks 𝐴 and 𝐵:
                                                                             1
         𝑑𝐴𝐵 = −     ⏟ ln⏞RV =       − ln RVℎ             −2
                                                          ⏟ ln
                                                             ⏞𝑍           − ln Γ𝐴
                                                                          ⏟ 2 ⏞
                                     ⏟ ⏞
                composite dissimilarity     adjusted dissimilarity 𝑑ℎ        dissimilarity between character weights
                                                                    𝐴𝐵                                                 relative dispersion, book
                                                                                                                                                     (6)
                                              1
                                            − ln Γ𝐵                    − ln(1 + 𝜖)
                                              2                        ⏟    ⏞
                                                                        centroid correction
                                            ⏟   ⏞
                                          relative dispersion, movie




                                                                       37
François Bavaud et al. CEUR Workshop Proceedings                                               33–42


where
    ∙ the "compromise" RV coefficient, RVℎ , defined as

                                            trace(𝐾ℎ𝐴 𝐾ℎ𝐵 )
                               RVℎ = √︁                             ∈ [0, 1]                      (7)
                                          trace(𝐾ℎ𝐴
                                                 2 )trace(𝐾 2 )
                                                           ℎ𝐵

      which measures the similarity between dissimilarities 𝐷𝐴 and 𝐷𝐵 in the common com-
      promise weighting ℎ.
    ∙ 𝑍 ∈ [0, 1] in (3) is a measure of similarity between weights 𝑓𝐴 and 𝑓𝐵 , taking on its
      maximum value 𝑍 = 1 iff 𝑓𝐴 = 𝑓𝐵 , and its minimum value 𝑍 = 0 iff the two versions
      have no character in common.
              trace(𝐾 2 )
    ∙ Γ𝐴 = trace(𝐾ℎ𝐴   2  is a measure of the ratio of the (quartic) dispersion of configuration
                      𝐴)
      𝐷𝐴 in the compromise weighting ℎ to the dispersion of D𝐴 in the original weighting
      𝑓𝐴 (Γ𝐵 is defined analogously).
      − 12 ln Γ𝐴 > 0 essentially means that the average contrast between characters (as ex-
      pressed by D𝐴 ) is stronger in the original version 𝑓𝐴 than in the compromise version ℎ,
      which is in particular likely to occur when "eccentric" characters in version 𝐴 occur less
      often in version 𝐵.
    ∙ the quantity
                                                    𝜅𝐴𝐵
                                𝜖=       √︁
                                     RVℎ Γ𝐴 Γ𝐵 trace(𝐾𝐴   2 )trace(𝐾 2 )
                                                                     𝐵

      is a normalized measure of the centroid correction occurring in (4). It reflects a "polariza-
      tion effect" due to centroid change x̄f𝐴 → x̄h and x̄f𝐵 → x̄h , since the overall dispersions
      𝐷𝐴 and 𝐷𝐵 are bound to vary when the reference point is moved to from the centroid
      configuration. Its magnitude is expected to be small since main common characters (i.e.
      those with large compromise weights h) are precisely the most frequent in both versions
      𝐴 and 𝐵.


4. The case study
The Lion, the Witch and the Wardrobe was the second of the seven novels of the The Chronicles of
Narnia, written by C. S. Lewis in 1950, and adapted into a film directed by A. Adamson released
in 2005 (figure 4).
   After semi-manual annotation of all named entities throughout the book and the movie script
with the module charnetto [7], then gathered into groups of aliases, a list of 37 distinct characters
were identified:

    ∙ 16 characters are common to the book and the movie
    ∙ 8 characters occur in the book only
    ∙ 13 characters occur in the movie only.




                                                 38
François Bavaud et al. CEUR Workshop Proceedings                                              33–42




Figure 2: The two works under study: the book (A) and the movie (B)




Figure 3: Book character weights √︀
                                  f𝐴 (left), movie character weights f𝐵 (middle) and compromise char-
acter weights h (right), with ℎ𝑖 = 𝑓𝑖𝐴 𝑓𝑖𝐵 /𝑍 . Characters appearing in both versions are represented
by grey bars, and otherwise by white bars.




   For each work, we defined the edge weights as the cross-count matrix 𝑐𝑖𝑗 = "number of
co-occurrences of characters 𝑖 and 𝑗 within a window of 5 paragraphs" (each paragraph being
delimited by a line break), with 𝑐𝑖𝑖 = 0 (see figure 4). Similarly, the character weights were, for
a given work, simply defined as 𝑓𝑖 = 𝑐𝑖∙ /𝑐∙∙ . Figure 4 depicts the corresponding networks.
   The cross-count matrices C permit to compute commute-time distances (section 2.1.1) and
diffusive distances (section 2.1.2). Weighted MDS (section 2.2) allows to extract in turn character
coordinates, as depicted in figure 6.
   In the present study, the centroids of configurations (f𝐴 , D𝐴 ) and (f𝐵 , D𝐵 ) are located at
the origin by construction, while the first coordinates of the centroids of (h, D𝐴 ) and (h, D𝐵 )
are (𝑥
     ¯𝐴ℎ,𝑦 ℎ ) = (0.002, 0.004), respectively (𝑥
          ¯𝐴                                    ¯𝐵  ¯𝐵
                                                 ℎ ,𝑦 ℎ ) = (−0.0007, −0.005), and fairly close to
the origin: 𝐷ℎ𝑓𝐴 = 5.6 · 10 , respectively 𝐷ℎ𝑓𝐵 = 4.0 · 10−5 . As a consequence, the terms
               𝐴              −5                  𝐵

𝜅𝐴𝐵 in (5) and 𝜀 in (6) are small.




                                                 39
François Bavaud et al. CEUR Workshop Proceedings                                                  33–42




Figure 4: Character networks of the book (left) and movie (right). Edge widths reflect the co-occurrences
𝑐𝑖𝑗 between nodes, and name sizes the corresponding degree 𝑐𝑖∙ .




Figure 5: Symmetric cross-count matrix C = (𝑐𝑖𝑗 ) for the book (column categories are identical to row
categories)




  The generalized coefficient RV defined in (2) (with differing weights) and the compromise
coefficient RVℎ defined in (7) turn out to be

                       RV = 0.113             RVℎ = 0.391         (diffusive distance)


                     RV = 0.531              RVℎ = 0.611      (commute-time distance)

In both cases, the magnitude of the term 𝜅𝐴𝐵 in (4), is negligible in comparison to
trace(Kℎ𝐴 Kℎ𝐵 ). Also, (6) reads here (in order)

                     2.1829 = 0.9385 + 0.2323 + 0.4349 + 0.5762 + 0.0011



                                                   40
François Bavaud et al. CEUR Workshop Proceedings                                                                                                                                                                                                             33–42


                                                                                                                                                                                                             OTMAN
                                                                                                                              Adam
                                                                                                                              Lilith
                                                                                                                                                                                                             DRYAD
                                                                                                                                                                                                            GRYPHON



                                       0.2




                                                                                                                                                                                  0.15
                                                                             White Witch      Fox

                                                                         Peter            Maugrim
                                                                                      Father Christmas
                                                                                 Mother


                                                                          Edmund                                                                                                                            OREIUS
  factor 2, explained inertia= 27.0%




                                                                                                                                             factor 2, explained inertia= 25.6%
                                                    Ivy
                                                  Betty    Mrs. Macready   Aslan Mr. Beaver
                                       0.0




                                                                 Susan




                                                                                                                                                                                  0.10
                                              Margaret    Professor         Giant Rumblebuffin

                                                                           Lucy Mrs. Beaver
                                                                             Mr. Tumnus


                                                                                                                                                                                                         SUSAN
                                                                                                    Jadis




                                                                                                                                                                                  0.05
                                       -0.2




                                                                                                                                                                                                          PETER
                                                                                                                                                                                                MRS. MACREADY
                                                                                                                                                                                                RADIO-MAN
                                                                                                                                                                                                   ROBIN
                                                                                                                                                                                                PHILIP    LUCY
                                                                                                                                                                                                   EDMUND ASLAN
                                                                                                                                                                                           ANNOUNCER MR. TUMNUS




                                                                                                                                                                                  0.00
                                                                                                                                                                                           PROFESSOR                                         JADIS   GENERAL OTMIN
                                                                                                                                                                                           MRS. PEVENSIE HORSE    WHITE WITCH DWARF
                                       -0.4




                                                                                                                                                                                                   MR. BEAVER
                                                                                                                                                                                                        GUARD         GINNABRICK
                                                                                                                                                                                                            MAUGRIM
                                                                                                              Silenus
                                                                                                            Bacchus                                                                                  FATHER CHRISTMAS
                                                                                                                                                                                            MRS. BEAVER FOX
                                                          -0.2    -0.1          0.0                   0.1               0.2       0.3                                                    -0.1               0.0                0.1             0.2              0.3

                                                                   factor 1, explained inertia= 38.4%                                                                                                         factor 1, explained inertia= 38.8%



Figure 6: First MDS coordinates of the characters of the book (left) and movie (right). They have been
extracted from the inter-character diffusive distances Ddiff                                      diff
                                                         𝐴 (𝑡) for the book (left), respectively D𝐵 (𝑡) for
the movie (right), with a diffusion time arbitrarily set to 𝑡 = 10. Characters in black appear in both
works, characters in green in one work only. The blue point depicts the corresponding centroids obtained
with the compromise distribution h (section 3.1).



for the diffusive distances, and
                                                                         0.6333 = 0.4917 + 0.2323 − 0.3431 + 0.2437 + 0.0087
for the commute-time distances.


5. Conclusion
Representing the relations between characters of a work as a weighted Euclidean configuration
(f , D) arguably constitutes an instance of very distant reading, but not more distant than the
usual representation by a weighted network. In both cases, the underlying dyadic formalism
(i.e. based upon character pairs) could, and maybe should, be extended to 𝑝-adic formalism,
taking into account the simultaneous co-occurrences of 𝑝 = 0, 1, 2, 3, . . . characters (cliques).
Also, the simple co-occurrence relation is in itself particularly rudimentary, yet surprisingly
efficient as attested in many applications of Data Analysis, Natural Language Processing and
Machine Learning.
    On the one hand, we recognize that the mathematical requirements needed to appreciate
(or not) the present proposal may distress some amateurs of character networks. Also, a fully
convincing literary interpretation of the various terms in decomposition (6) is yet to establish.
Furthermore, obtaining a single index (such as RV = 0.113) is neither terribly enlightening
nor helpful. Comparing more than two character networks is more satisfactory, but multiple
versions of character networks are alas rare.
    On the other hand, quantifying the dissimilarity between two networks cannot ignore mathe-
matical issues, and the proposed formalism permits to propose a procedure which can be made



                                                                                                                                        41
François Bavaud et al. CEUR Workshop Proceedings                                           33–42


fully automatic, and yields dissimilarities which can be shown to be metric, namely such that
𝑑𝐴𝐵 ≤ 𝑑𝐴𝐶 + 𝑑𝐶𝐵 (triangle inequality) for three versions 𝐴, 𝐵 and 𝐶. Also, the exact decom-
position permits a detailed, systematically comparable, analysis of sources of (dis)similarities
between two character networks.
   More generally, the present formalism may contribute to better anchor the study of character
networks into mainstream Data Analysis, and draw attention to otherwise overlooked phenom-
ena: for instance, the weights similarity index 𝑍 can be related to the Chernoff information
occurring in the Neyman-Pearson statistical testing framework [see e.g. 8]; dealing with distinct
distributions on the same objects endowed with pair distances evokes Optimal Transportation
theory, together with as the possible involving of the earth mover’s distance [see e.g. 9, 10]
in comparing two character networks; finally, the quantities Γ𝐴 and Γ𝐵 , which can exceed
or be inferior to one, should be interpreted as indicators of the diversity loss entailed by the
disappearance of 𝐴-specific characters in the compromise weighting, or, to the contrary, as a
diversity gain reflecting distinct emphasis and specificity between two variants 𝐴 and 𝐵. But,
as demonstrated in the case study, such a behaviour turns out to depend on the choice of the
character dissimilarities D, whose suitability from a more literary perspective should certainly
be further investigated in future developments.


References
 [1] F. Bavaud, Exact first moments of the RV coefficient by invariant orthogonal integration,
     Journal of Multivariate Analysis (2023) 105227.
 [2] P. Robert, Y. Escoufier, A unifying tool for linear multivariate statistical methods: the
     RV-coefficient, Journal of the Royal Statistical Society: Series C (Applied Statistics) 25
     (1976) 257–265.
 [3] J. G. Kemeny, J. L. Snell, Finite Markov chains: with a new appendix "Generalization of a
     fundamental matrix", Springer, 1983.
 [4] M. Saerens, F. Fouss, L. Yen, P. Dupont, The principal components analysis of a graph,
     and its relationships to spectral clustering, in: European conference on machine learning,
     Springer, 2004, pp. 371–383.
 [5] F. Bavaud, Spatial weights: Constructing weight-compatible exchange matrices from
     proximity matrices, in: M. Duckham, E. Pebesma, K. Stewart, A. U. Frank (Eds.), Geographic
     Information Science, Springer International Publishing, Cham, 2014, pp. 81–96.
 [6] J. Josse, J. Pagès, F. Husson, Testing the significance of the RV coefficient, Computational
     Statistics & Data Analysis 53 (2008) 82–91.
 [7] C. Métrailler, charnetto : a module designed to create an automated character network
     based on a book or a movie script, 2021. https://pypi.org/project/charnetto/.
 [8] T. M. Cover, Elements of information theory, John Wiley & Sons, 1999.
 [9] C. Villani, Optimal transport: old and new, volume 338, Springer, 2009.
[10] M. Cuturi, D. Avis, Ground metric learning, The Journal of Machine Learning Research 15
     (2014) 533–564.




                                               42