=Paper=
{{Paper
|id=Vol-3602/paper3
|storemode=property
|title=A (Dis)similarity Index for Comparing Two Character Networks Based on the Same Story
|pdfUrl=https://ceur-ws.org/Vol-3602/paper3.pdf
|volume=Vol-3602
|authors=François Bavaud,Coline Métrailler
|dblpUrl=https://dblp.org/rec/conf/comhum/BavaudM22
}}
==A (Dis)similarity Index for Comparing Two Character Networks Based on the Same Story==
A (Dis)similarity Index for Comparing Two Character Networks Based on the Same Story François Bavaud1,2 , Coline Métrailler1 1 Faculty of Arts, Department of Language and Information Sciences, University of Lausanne, bâtiment Anthropole, 1015 Lausanne, Switzerland 2 Faculty of Geosciences and Environment, Institute of Geography and Sustainability, University of Lausanne, bâtiment Géopolis, 1015 Lausanne, Switzerland Abstract Comparing networks is always a complicated matter, whose effective implementation strongly depends on the amount of shared information between them, in particular whether nodes, edges, weights etc. are identical, or not. In the case of character networks and adaptations (from book to movie, from movie to theater, and so on), the formal challenge proves stimulating: some characters will be mapped from one work to the other, some will have no correspondence, and their weights, measuring their relative occurence, are bound to differ. This formal contribution, rooted in Multivariate Data Analysis, proposes a presumably novel similarity index, the generalized weighted RV coefficient, taking into account both the difference in character weights (nodes) and in character interactions (edges). This approach first requires to transform the character networks into weighted squared Euclidean configurations. We then compare a novel of C.S. Lewis, part of the series The Chronicles of Narnia, and the script of its film adaptation to illustrate the proposal and the results. 1. Introduction Networks of fictional characters often exist in two or more versions. For instance (section 4), network 𝐴 is built from a novel, and network 𝐵 from a movie adaptation. Besides the main characters common to both versions, there are characters proper to a single version only. Also, the importance of common characters (as measured, e.g., by their relative occurrence), is bound to vary between the two versions, as is the strength of their mutual relations (as e.g. measured by their relative co-occurrence). Hence, the networks 𝐴 and 𝐵 differ both along character weights (nodes) and interaction weights (edges). Our contribution proposes the definition of a single index measuring the overall similarity between 𝐴 and 𝐵. This index, noted RV, constitutes an innovative generalization, involving two distinct sets of object weights, of the weighted RV-coefficient [1], which is itself a generalization of the original, unweighted RV-coefficient [2] (where R did refer to "correlation" and V to "vector"). In particular, RV ∈ [0, 1], with RV = 1 iff 𝐴 and 𝐵 are identical (i.e., same character weights and dissimilarities between characters), and RV = 0 iff 𝐴 and 𝐵 have no character in common. This COMHUM 2022: Workshop on Computational Methods in the Humanities, June 09–10, 2022, Lausanne, Switzerland $ fbavaud@unil.ch (F. Bavaud); coline.metrailler@unil.ch (C. Métrailler) 0000-0002-4565-0715 (F. Bavaud); 0000-0002-3196-481X (C. Métrailler) © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 33 CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings François Bavaud et al. CEUR Workshop Proceedings 33–42 similarity coefficient can in turn be transformed into a dissimilarity coefficient, which can be additively decomposed into five components. The formalism, exposed in section 2, is rooted into Weighted Data Analysis. It involves three major steps: 1. Transforming a weighted network into a weighted Euclidean configuration (section 2.1). This step is chiefly dictated by the formalism, which requires squared Euclidean dissimi- larities, but permits as a byproduct a visualization of the character network (by weighted multidimensional scaling) of interest in itself. 2. Transforming the weighted Euclidean configuration into a kernel, whose eigen- decomposition permits to visualize the network nodes (section 2.2) 3. Computing the generalized RV coefficient (beginning of section 3), assessing the similarity between two networks whose node weights may differ, and its exact decomposition into five terms (sections 3.1 and 3.2). 2. Visualization of character networks: a few "reminders" We formalize character networks as weighted ∑︀ networks (f , C), where f is the vector of the 𝑛 character weights, obeying 𝑓𝑖 ≥ 0 and 𝑛𝑖=1 𝑓𝑖 = 1. The 𝑛 × 𝑛 matrix of edge weights C = (𝑐𝑖𝑗 ) is non-negative, and quantifies the importance of edge 𝑖𝑗, reflecting some kind of affinity between the characters 𝑖 and 𝑗, such as the their co-occurrence in the present study (section 4). It can be symmetric (as for C = A, where A is a binary adjacency matrix in a non-directed network), or not (as for asymmetric social relationships), in which case the network is directed. 2.1. Extracting Euclidean dissimilarities from a weighted network Network visualization is a boundless research topic, even restricted as done here to the Euclidean embedding of networks. Specifically, one seeks to extract from the network data (f , C) a matrix D = (𝐷𝑖𝑗 ) of squared Euclidean dissimilarities between nodes, that is of the form 𝐷𝑖𝑗 = ‖x𝑖 − x𝑗 ‖2 , where x𝑖 is the representative vector of node 𝑖. Two proposals only, among many others possibilities (extracting squared Euclidean dissimilarities from a weighted network is a topic in itself), are considered in this contribution: commute-time distances (section 2.1.1) and diffusive distances (section 2.1.2). 2.1.1. Commute-time distances By construction, the square matrix P = (𝑝𝑖𝑗 ) with components 𝑝𝑖𝑗 = 𝑐𝑖𝑗 /𝑐𝑖∙ (where 𝑐𝑖∙ = 𝑛 𝑗=1 𝑐𝑖𝑗 ) is non-negative, with 𝑝𝑖∙ = 1 : it therefore constitutes the transition matrix of a ∑︀ Markov chain, defining a random walk on the network. The commute time 𝐷𝑖𝑗 com is the average time needed to go from 𝑖 to 𝑗 and then back to 𝑖 [see e.g. 3]. The matrix Dcom = (𝐷𝑖𝑗 com ) of commute times is well-known to be squared Euclidean [see e.g. 4], irrespectively of the properties of C, as far as C is reducible – that is as far as any two states can be directly or indirectly connected by the random walk. 34 François Bavaud et al. CEUR Workshop Proceedings 33–42 2.1.2. Diffusive distances One considers the edge weights V as generating a so-called instantaneous jump process, per- mitting to navigate from node 𝑖 to node 𝑗 with a rate given by (minus) the components of the 1 1 weighted Laplacian L = Π− 2 [diag(C1𝑛 ) − C]Π− 2 , where Π = diag(f ), and C must be taken as symmetric, that is replaced if necessary by 12 (C + C⊤ ). Then choose a diffusion time 𝑡 > 0, and compute the joint probability 𝑒𝑖𝑗 (𝑡) to be initially in 𝑖 and in 𝑗 at time 𝑡 (or the other way round) as 1 1 E(𝑡) = (𝑒𝑖𝑗 (𝑡)) = Π 2 exp(−𝑡 L) Π 2 The squared Euclidean diffusive distances Ddiff (𝑡) = (𝐷𝑖𝑗 diff (𝑡)) finally obtain as [see e.g. 5] diff 𝑒𝑖𝑖 (𝑡) 𝑒𝑗𝑗 (𝑡) 𝑒𝑖𝑗 (𝑡) 𝐷𝑖𝑗 (𝑡) = 2 + 2 −2 . 𝑓𝑖 𝑓𝑗 𝑓𝑖 𝑓𝑗 2.2. Visualizing a weighted Euclidean configuration by weighted MDS Weighted multidimensional scaling constitutes the canonical procedure for the low-dimensional visualisation of a weighted configuration (f , D) (see figure 1): 1. define Π = diag(f ), as well as the weighted centering matrix H = I𝑛 − 1𝑛 f ⊤ 2. obtain by double centering the matrix of scalar products B = − 12 H D H⊤ 3. define the matrix of weighted scalar products or kernel as K as: √ √ (1) √︀ K = ΠB Π 𝐾𝑖𝑗 = 𝑓𝑖 𝑓𝑗 𝐵𝑖𝑗 4. perform the spectral decomposition K = UΛU⊤ where U = (𝑢𝑖𝛼 ) is orthogonal and Λ = diag(𝜆) diagonal √ √ 5. finally, define 𝑥𝑖𝛼 = ∑︀ 𝑢𝑖𝛼 𝜆𝛼 / 𝑓𝑖 , which is the MDS coordinate of node 𝑖 in dimension 𝛼. By construction, 𝛼 (𝑥𝑖𝛼 − 𝑥𝑗𝛼 )2 = 𝐷𝑖𝑗∑︀ . Also, the total dispersion or inertia of the 𝑛 ∑︀𝑛−1 weighted configuration (f , D) reads Δ = 2 𝑖,𝑗=1 𝑓𝑖 𝑓𝑗 𝐷𝑖𝑗 = tr(K) = 𝛼=1 𝜆𝛼 . 1 3. The generalized weighted RV coefficient At this stage, the book and the movie networks of characters have been expressed into commute time or diffusive kernels (1), namely 𝐾𝐴 and 𝐾𝐵 . Each kernel defines a weighted configuration (f𝐴 , D𝐴 ), respectively (f𝐵 , D𝐵 ), and conversely. The weighted RV coefficient between both configurations is defined as [1] CV𝐴𝐵 RV = RV𝐴𝐵 = √ where CV𝐴𝐵 = Trace(𝐾𝐴 𝐾𝐵 ) (2) CV𝐴𝐴 CV𝐵𝐵 and constitues a straightforward generalization of the original, unweighted RV coefficient [2], providing the cosine similarity between the vectorized matrices K𝐴 and K𝐵 (figure 1). In particular, RV𝐴𝐵 ≥ 0 (since K𝐴 and K𝐵 are positive semi-definite: this condition is equivalent to the squared Euclidean nature of D𝐴 and D𝐵 ), RV𝐴𝐵 ≤ 1 (by the Cauchy-Schwarz inequality) and RV𝐴𝐴 = 1. 35 François Bavaud et al. CEUR Workshop Proceedings 33–42 Figure 1: The weighted RV coefficient measures the similarity between two weighted configurations (f , D𝐴 ) (left) and (f , D𝐵 ) (right) embedded in R𝑛−1 . Here the 𝑛 objects (characters) are endowed with the same weights f in both configurations, and the object coordinates are obtained from weighted MDS applied on squared Euclidean dissimilarities D𝐴 , respectively D𝐵 . The null distribution of the weighted RV coefficient, i.e. assuming no relationships between the two configurations, and in particular its statistical significance, have been extensively investigated during the last decades [see e.g. 6, 1, and references therein]. The crucial issue here is that the character weights f𝐴 and f𝐵 differ in the two character networks, whence the naming generalized weighted RV coefficient for the same quantity (2), where √︀ √︀ K𝐴 = Π𝐴 B𝐴 Π𝐴 Π𝐴 = diag(f𝐴 ) 1 B𝐴 = − H𝐴 D𝐴 H⊤ 𝐴 H𝐴 = I𝑛 − 1𝑛 f𝐴⊤ √︀ 2 √︀ K𝐵 = Π𝐵 B𝐵 Π𝐵 Π𝐵 = diag(f𝐵 ) 1 B𝐵 = − H𝐵 D𝐵 H⊤ 𝐵 H𝐵 = I𝑛 − 1𝑛 f𝐵⊤ 2 This circumstance generates new challenging issues, whose investigation was one of the motivations for embarking on the present piece of research. Note that well-established statistical procedures permitting to test the statistical significance of the weighted coefficient RVℎ are available [see e.g. 6, 1, and references therein]. However, testing the generalized weighted RV coefficient is, at the present time, a completely open issue. 3.1. Decomposition of the generalized weighted RV coefficient As mentioned, the relative weights f𝐴 and f𝐵 (set to the uniform weights 1/𝑛 in most applica- tions of Multivariate Analysis) may differ to a spectacular extent: their supports supp(f𝐴 ) and supp(f𝐵 ) do not even coincide in general, since version 𝐴 may contain characters absent in version 𝐵, and vice-versa. 36 François Bavaud et al. CEUR Workshop Proceedings 33–42 Define the compromise weight h as √︁ 𝑓𝑖𝐴 𝑓𝑖𝐵 ∑︁ √︁ ℎ𝑖 = where 𝑍 = 𝑓𝑗𝐴 𝑓𝑗𝐵 ∈ [0, 1] . (3) 𝑍 𝑗 This choice, initially dictated by formal considerations, permitting to further transform the square roots in (1) into a tractable expression, turns out to be conceptually convenient and interpretable as well: 𝑍 is a measure of weights dissimilarity, appearing in identity (6) below. Also, ℎ𝑖 = 0 unless character 𝑖 appears in both versions (figure 3). A little algebra demonstrates the numerator of the similarity index (2) to express as CV𝐴𝐵 = trace(Kℎ𝐴 Kℎ𝐵 ) + 𝜅𝐴𝐵 (4) 𝑍2 where 𝐾ℎ𝐴 is the kernel associated to configuration (ℎ, 𝐷𝐴 ) and 𝐾ℎ𝐵 is the kernel associated to configuration (ℎ, 𝐷𝐵 ). Also, ∑︁ 𝜅𝐴𝐵 = 𝐷ℎ𝑓 𝐴 𝐴 𝐵 𝐷ℎ𝑓𝐵 +2 ℎ𝑖 ℓ𝐴 𝐵 𝑖 ℓ𝑖 (5) 𝑖 where 𝐷ℎ𝑓 𝐴 𝐴 is the squared Euclidean distance between the gravity centers of (h, D𝐴 ) and (f𝐴 , D𝐴 ). The quantity 𝐷ℎ𝑓 𝐵 is defined analogously. Naturally, the gravity centers of (f , D ) 𝐵 𝐴 𝐴 and (h, D𝐴 ) generally differ, as are the gravity centers of (f𝐴 , D𝐵 ) and (h, D𝐵 ), but the differences are extremely small in the case study (see Figure 6). The second component in (5) involves a weighted covariance between the h-centered vectors ℓ𝐴 = Bℎ𝐴 f𝐴 and ℓ𝐵 = Bℎ𝐵 f𝐵 , where 1 1 Bℎ𝐴 = − Hℎ D𝐴 H⊤ ℎ Bℎ𝐵 = − Hℎ D𝐵 H⊤ ℎ 2 2 This second component is again zero if the compromise centroid coincides with the original centroid in configuration 𝐴 , or 𝐵, or both. In short, the term 𝜅𝐴𝐵 in (5), which can be negative (as here in the two distance variants), represents a correction due the non-coincidence of the 𝑓𝐴 - and ℎ-centroids in configuration 𝐴 (respectively the 𝑓𝐵 - and ℎ-centroids in configuration 𝐵). 3.2. An exact additive decomposition formula A similarity coefficient such as the generalized weighted coefficient RV ∈ [0, 1] can be simply converted into a dissimilarity coefficient 𝑑 ∈ [0, ∞) by 𝑑 = − ln RV. Applying the transforma- tion to (2), taking into account the previous definitions and performing direct, down-to-earth algebraic operations finally yields the following exact decomposition for the dissimilarity between character networks 𝐴 and 𝐵: 1 𝑑𝐴𝐵 = − ⏟ ln⏞RV = − ln RVℎ −2 ⏟ ln ⏞𝑍 − ln Γ𝐴 ⏟ 2 ⏞ ⏟ ⏞ composite dissimilarity adjusted dissimilarity 𝑑ℎ dissimilarity between character weights 𝐴𝐵 relative dispersion, book (6) 1 − ln Γ𝐵 − ln(1 + 𝜖) 2 ⏟ ⏞ centroid correction ⏟ ⏞ relative dispersion, movie 37 François Bavaud et al. CEUR Workshop Proceedings 33–42 where ∙ the "compromise" RV coefficient, RVℎ , defined as trace(𝐾ℎ𝐴 𝐾ℎ𝐵 ) RVℎ = √︁ ∈ [0, 1] (7) trace(𝐾ℎ𝐴 2 )trace(𝐾 2 ) ℎ𝐵 which measures the similarity between dissimilarities 𝐷𝐴 and 𝐷𝐵 in the common com- promise weighting ℎ. ∙ 𝑍 ∈ [0, 1] in (3) is a measure of similarity between weights 𝑓𝐴 and 𝑓𝐵 , taking on its maximum value 𝑍 = 1 iff 𝑓𝐴 = 𝑓𝐵 , and its minimum value 𝑍 = 0 iff the two versions have no character in common. trace(𝐾 2 ) ∙ Γ𝐴 = trace(𝐾ℎ𝐴 2 is a measure of the ratio of the (quartic) dispersion of configuration 𝐴) 𝐷𝐴 in the compromise weighting ℎ to the dispersion of D𝐴 in the original weighting 𝑓𝐴 (Γ𝐵 is defined analogously). − 12 ln Γ𝐴 > 0 essentially means that the average contrast between characters (as ex- pressed by D𝐴 ) is stronger in the original version 𝑓𝐴 than in the compromise version ℎ, which is in particular likely to occur when "eccentric" characters in version 𝐴 occur less often in version 𝐵. ∙ the quantity 𝜅𝐴𝐵 𝜖= √︁ RVℎ Γ𝐴 Γ𝐵 trace(𝐾𝐴 2 )trace(𝐾 2 ) 𝐵 is a normalized measure of the centroid correction occurring in (4). It reflects a "polariza- tion effect" due to centroid change x̄f𝐴 → x̄h and x̄f𝐵 → x̄h , since the overall dispersions 𝐷𝐴 and 𝐷𝐵 are bound to vary when the reference point is moved to from the centroid configuration. Its magnitude is expected to be small since main common characters (i.e. those with large compromise weights h) are precisely the most frequent in both versions 𝐴 and 𝐵. 4. The case study The Lion, the Witch and the Wardrobe was the second of the seven novels of the The Chronicles of Narnia, written by C. S. Lewis in 1950, and adapted into a film directed by A. Adamson released in 2005 (figure 4). After semi-manual annotation of all named entities throughout the book and the movie script with the module charnetto [7], then gathered into groups of aliases, a list of 37 distinct characters were identified: ∙ 16 characters are common to the book and the movie ∙ 8 characters occur in the book only ∙ 13 characters occur in the movie only. 38 François Bavaud et al. CEUR Workshop Proceedings 33–42 Figure 2: The two works under study: the book (A) and the movie (B) Figure 3: Book character weights √︀ f𝐴 (left), movie character weights f𝐵 (middle) and compromise char- acter weights h (right), with ℎ𝑖 = 𝑓𝑖𝐴 𝑓𝑖𝐵 /𝑍 . Characters appearing in both versions are represented by grey bars, and otherwise by white bars. For each work, we defined the edge weights as the cross-count matrix 𝑐𝑖𝑗 = "number of co-occurrences of characters 𝑖 and 𝑗 within a window of 5 paragraphs" (each paragraph being delimited by a line break), with 𝑐𝑖𝑖 = 0 (see figure 4). Similarly, the character weights were, for a given work, simply defined as 𝑓𝑖 = 𝑐𝑖∙ /𝑐∙∙ . Figure 4 depicts the corresponding networks. The cross-count matrices C permit to compute commute-time distances (section 2.1.1) and diffusive distances (section 2.1.2). Weighted MDS (section 2.2) allows to extract in turn character coordinates, as depicted in figure 6. In the present study, the centroids of configurations (f𝐴 , D𝐴 ) and (f𝐵 , D𝐵 ) are located at the origin by construction, while the first coordinates of the centroids of (h, D𝐴 ) and (h, D𝐵 ) are (𝑥 ¯𝐴ℎ,𝑦 ℎ ) = (0.002, 0.004), respectively (𝑥 ¯𝐴 ¯𝐵 ¯𝐵 ℎ ,𝑦 ℎ ) = (−0.0007, −0.005), and fairly close to the origin: 𝐷ℎ𝑓𝐴 = 5.6 · 10 , respectively 𝐷ℎ𝑓𝐵 = 4.0 · 10−5 . As a consequence, the terms 𝐴 −5 𝐵 𝜅𝐴𝐵 in (5) and 𝜀 in (6) are small. 39 François Bavaud et al. CEUR Workshop Proceedings 33–42 Figure 4: Character networks of the book (left) and movie (right). Edge widths reflect the co-occurrences 𝑐𝑖𝑗 between nodes, and name sizes the corresponding degree 𝑐𝑖∙ . Figure 5: Symmetric cross-count matrix C = (𝑐𝑖𝑗 ) for the book (column categories are identical to row categories) The generalized coefficient RV defined in (2) (with differing weights) and the compromise coefficient RVℎ defined in (7) turn out to be RV = 0.113 RVℎ = 0.391 (diffusive distance) RV = 0.531 RVℎ = 0.611 (commute-time distance) In both cases, the magnitude of the term 𝜅𝐴𝐵 in (4), is negligible in comparison to trace(Kℎ𝐴 Kℎ𝐵 ). Also, (6) reads here (in order) 2.1829 = 0.9385 + 0.2323 + 0.4349 + 0.5762 + 0.0011 40 François Bavaud et al. CEUR Workshop Proceedings 33–42 OTMAN Adam Lilith DRYAD GRYPHON 0.2 0.15 White Witch Fox Peter Maugrim Father Christmas Mother Edmund OREIUS factor 2, explained inertia= 27.0% factor 2, explained inertia= 25.6% Ivy Betty Mrs. Macready Aslan Mr. Beaver 0.0 Susan 0.10 Margaret Professor Giant Rumblebuffin Lucy Mrs. Beaver Mr. Tumnus SUSAN Jadis 0.05 -0.2 PETER MRS. MACREADY RADIO-MAN ROBIN PHILIP LUCY EDMUND ASLAN ANNOUNCER MR. TUMNUS 0.00 PROFESSOR JADIS GENERAL OTMIN MRS. PEVENSIE HORSE WHITE WITCH DWARF -0.4 MR. BEAVER GUARD GINNABRICK MAUGRIM Silenus Bacchus FATHER CHRISTMAS MRS. BEAVER FOX -0.2 -0.1 0.0 0.1 0.2 0.3 -0.1 0.0 0.1 0.2 0.3 factor 1, explained inertia= 38.4% factor 1, explained inertia= 38.8% Figure 6: First MDS coordinates of the characters of the book (left) and movie (right). They have been extracted from the inter-character diffusive distances Ddiff diff 𝐴 (𝑡) for the book (left), respectively D𝐵 (𝑡) for the movie (right), with a diffusion time arbitrarily set to 𝑡 = 10. Characters in black appear in both works, characters in green in one work only. The blue point depicts the corresponding centroids obtained with the compromise distribution h (section 3.1). for the diffusive distances, and 0.6333 = 0.4917 + 0.2323 − 0.3431 + 0.2437 + 0.0087 for the commute-time distances. 5. Conclusion Representing the relations between characters of a work as a weighted Euclidean configuration (f , D) arguably constitutes an instance of very distant reading, but not more distant than the usual representation by a weighted network. In both cases, the underlying dyadic formalism (i.e. based upon character pairs) could, and maybe should, be extended to 𝑝-adic formalism, taking into account the simultaneous co-occurrences of 𝑝 = 0, 1, 2, 3, . . . characters (cliques). Also, the simple co-occurrence relation is in itself particularly rudimentary, yet surprisingly efficient as attested in many applications of Data Analysis, Natural Language Processing and Machine Learning. On the one hand, we recognize that the mathematical requirements needed to appreciate (or not) the present proposal may distress some amateurs of character networks. Also, a fully convincing literary interpretation of the various terms in decomposition (6) is yet to establish. Furthermore, obtaining a single index (such as RV = 0.113) is neither terribly enlightening nor helpful. Comparing more than two character networks is more satisfactory, but multiple versions of character networks are alas rare. On the other hand, quantifying the dissimilarity between two networks cannot ignore mathe- matical issues, and the proposed formalism permits to propose a procedure which can be made 41 François Bavaud et al. CEUR Workshop Proceedings 33–42 fully automatic, and yields dissimilarities which can be shown to be metric, namely such that 𝑑𝐴𝐵 ≤ 𝑑𝐴𝐶 + 𝑑𝐶𝐵 (triangle inequality) for three versions 𝐴, 𝐵 and 𝐶. Also, the exact decom- position permits a detailed, systematically comparable, analysis of sources of (dis)similarities between two character networks. More generally, the present formalism may contribute to better anchor the study of character networks into mainstream Data Analysis, and draw attention to otherwise overlooked phenom- ena: for instance, the weights similarity index 𝑍 can be related to the Chernoff information occurring in the Neyman-Pearson statistical testing framework [see e.g. 8]; dealing with distinct distributions on the same objects endowed with pair distances evokes Optimal Transportation theory, together with as the possible involving of the earth mover’s distance [see e.g. 9, 10] in comparing two character networks; finally, the quantities Γ𝐴 and Γ𝐵 , which can exceed or be inferior to one, should be interpreted as indicators of the diversity loss entailed by the disappearance of 𝐴-specific characters in the compromise weighting, or, to the contrary, as a diversity gain reflecting distinct emphasis and specificity between two variants 𝐴 and 𝐵. But, as demonstrated in the case study, such a behaviour turns out to depend on the choice of the character dissimilarities D, whose suitability from a more literary perspective should certainly be further investigated in future developments. References [1] F. Bavaud, Exact first moments of the RV coefficient by invariant orthogonal integration, Journal of Multivariate Analysis (2023) 105227. [2] P. Robert, Y. Escoufier, A unifying tool for linear multivariate statistical methods: the RV-coefficient, Journal of the Royal Statistical Society: Series C (Applied Statistics) 25 (1976) 257–265. [3] J. G. Kemeny, J. L. Snell, Finite Markov chains: with a new appendix "Generalization of a fundamental matrix", Springer, 1983. [4] M. Saerens, F. Fouss, L. Yen, P. Dupont, The principal components analysis of a graph, and its relationships to spectral clustering, in: European conference on machine learning, Springer, 2004, pp. 371–383. [5] F. Bavaud, Spatial weights: Constructing weight-compatible exchange matrices from proximity matrices, in: M. Duckham, E. Pebesma, K. Stewart, A. U. Frank (Eds.), Geographic Information Science, Springer International Publishing, Cham, 2014, pp. 81–96. [6] J. Josse, J. Pagès, F. Husson, Testing the significance of the RV coefficient, Computational Statistics & Data Analysis 53 (2008) 82–91. [7] C. Métrailler, charnetto : a module designed to create an automated character network based on a book or a movie script, 2021. https://pypi.org/project/charnetto/. [8] T. M. Cover, Elements of information theory, John Wiley & Sons, 1999. [9] C. Villani, Optimal transport: old and new, volume 338, Springer, 2009. [10] M. Cuturi, D. Avis, Ground metric learning, The Journal of Machine Learning Research 15 (2014) 533–564. 42