=Paper=
{{Paper
|id=Vol-3602/paper4
|storemode=property
|title=A Framework for Embedding Entities in a Textual Narrative: a Case Study on Les Misérables
|pdfUrl=https://ceur-ws.org/Vol-3602/paper4.pdf
|volume=Vol-3602
|authors=Guillaume Guex
|dblpUrl=https://dblp.org/rec/conf/comhum/Guex22
}}
==A Framework for Embedding Entities in a Textual Narrative: a Case Study on Les Misérables==
<pdf width="1500px">https://ceur-ws.org/Vol-3602/paper4.pdf</pdf>
<pre>
                                A Framework for Embedding Entities in a Textual
                                Narrative: a Case Study on Les Misérables
                                Guillaume Guex
                                Faculty of Arts, Department of Language and Information Sciences, University of Lausanne, bâtiment Anthropole,
                                1015 Lausanne, Switzerland


                                                                      Abstract
                                                                      In this article, we propose a general and flexible framework in order to study narrative entities found in a
                                                                      literary work. This framework is exposed starting from a broad perspective, consisting in how to segment
                                                                      the work into textual units and organize the resulting data, and is narrowed down to a particular case: the
                                                                      study of characters and relationships found in Les Misérables. A notable choice was made in the current
                                                                      instance of the framework: the construction of embeddings containing both textual units and narrative
                                                                      entities alongside words. These embeddings, where different spatial regions can be interpreted with
                                                                      word vectors, are the keys helping us to characterize studied entities. Four types of embedding methods
                                                                      are constructed, and their results on Les Misérables permit to show the potential of this framework in
                                                                      order to analyzes characters and relationships in a narrative.

                                                                      Keywords
                                                                      Digital Humanities, Distant Reading, Textual Narrative, Narrative Entity, Embeddings, Characters


                                1. Introduction
                                In the field of Digital Humanities, Distant Reading tools [1] allow researchers to quickly gain
                                knowledge on textual corpora without actually reading them. Purposes of these methods are
                                various, but can be mainly categorized into two groups: in the first case, these methods are
                                used to tag, classify, or summarize large quantities of documents, in order to quickly structure
                                information or to deliver a speech over the whole studied corpus [2]. Methods, in this case, rely
                                heavily on Big Data and make an extensive use of Machine Learning, often with the help of
                                supervised methods. In the second case, researchers use computational methods to underline
                                hidden structures in a small corpus or even a single document, which helps them to refine their
                                understanding of this corpus or to validate hypotheses [3]. Methods in this setting can also rely
                                on Machine Learning, but must typically be built with more caution and attention to details:
                                corpora are smaller, analyses are closer to the work, and methods must be transparent in order
                                to appropriately interpret results. The use of exploratory tools and unsupervised methods is
                                also preferred in this context, as it is less desirable to base methods on information coming from
                                large external corpora. The proposed method in this article typically belongs to the second
                                group, as it is unsupervised and can be applied on a single document.


                                COMHUM 2022: Workshop on Computational Methods in the Humanities, June 9–10, 2022, Lausanne, Switzerland
                                $ guillaume.guex@unil.ch (G. Guex)
                                 0000-0003-1001-9525 (G. Guex)
                                                                    © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                 CEUR
                                 Workshop
                                 Proceedings
                                               http://ceur-ws.org
                                               ISSN 1613-0073
                                                                    CEUR Workshop Proceedings (CEUR-WS.org)


                                                                                                                                       43


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
Guillaume Guex CEUR Workshop Proceedings                                                        43–61


   When a single (or a few) literary work is analyzed, a common practice is to study narrative
entities (characters, events, locations, etc.) used by the author in her/his book [4]. Researchers
are frequently interested in depicting them and in seeing how they interact with each other
in the story. Various computational tools can help them in this task, to name a few: Named
Entity Recognition tools [5, 6, 7], Automatic Character Networks Extraction [8], Sentiment
Analysis and Topic Modeling [9], Textometry [10], and Word Embeddings [11, 12, 13]. All these
methods have been used in order to explicitly show hidden structures constructed by the author
in her/his work. It permits to find patterns, and can help to categorize particular narrative
constructions, writing styles, or genres. These kinds of methods can be a great complement to
classical analyses of literary works as they allow to efficiently summarize information which is
otherwise quite diffuse.
   In this article, we propose a general framework in order to automatically characterize various
narrative entities in a literary work. The entire framework is exposed starting from a wide
perspective, which is how to organize the textual data, and is narrowed down to a specific use,
the study of character relationships in Les Misérables, by Victor Hugo. Along this presentation,
various choices are made to highlight a particular use of this framework, but these choices
should be viewed as suggestions rather than rules: the real strength of this framework is its
flexibility and the direction taken in this article is oriented for a defined task. To be more
specific, we will show how to use embeddings in order to locate characters and their relationships
alongside the vocabulary. An association measure can then be constructed between these words
and entities, which can help a practitioner to depict them. Four variations of this method are
proposed, and are tested on Les Misérables.
   The idea behind this framework comes from the field of automatic extraction and analysis of
character networks from literary works (see [8] for a survey). When building character networks
from a textual narrative, one of the most widespread methods consists in dividing the studied
work into 𝑛 narrative units or contexts 𝑢1 , . . . , 𝑢𝑛 , which can be, e.g., sentences, paragraphs, or
chapters, and then counting the number of units where characters co-occurred [9, 14, 15, 16, 17].
Usually, the text constituting these units is discarded and the resulting network displays edges
which roughly represent an aggregated number of interactions between characters. However, by
doing so, the aggregation occurs on various interactions and gives little information about the
type of relationship which exists between characters. Various improvements were proposed in
order to weight [18] or sign [19] (or both [9]) the edges in the character networks. A particular
inspiration for the current work is the article by Min and Park (2019) [9], where authors also
analyzed characters in Les Misérables by building various signed and weighted networks, with
the help of Sentiment Analysis and Topic Modeling. The current framework was built in order
to generalize this idea of refining character relationships by formalizing the data structure and
keeping directions of exploration as wide as possible. Embeddings [20] appeared to us to be the
proper tool for achieving this. As a matter of fact, with embeddings, the textual contents of
units are transformed into workable mathematical objects (the vectors), usable for various tasks,
while conserving a maximum of information. The framework has been further generalized in
order to be applicable on different sorts of narrative entities, but the presented case remains the
study of character relationships in Les Misérables.
   The current article is structured as follows. Section 2 defines the framework, with section 2.1
defining the data organization, section 2.2 describing how to embed textual units, and section


                                                  44
Guillaume Guex CEUR Workshop Proceedings                                                         43–61


2.3 deriving entity vectors lying in the same space as units. In section 3, we present the specific
methodology and results for the case study of character relationships in Les Misérables, and
section 4 draws conclusions and perspectives about this work. All (Python) scripts and datasets
used in this article, as well as extended results, can be found in the dedicated GitHub repository.1


2. Framework
2.1. Data organization
In this article, a textual narrative is divided in 𝑛 textual units 𝑢1 , . . . , 𝑢𝑛 , and is represented
through two tables. The first one is well known in the field of textual analysis and consists in
the (𝑛 × 𝑣) unit-word contingency table N, as represented by Table 1, where 𝑣 is the vocabulary
size. In this table, each row represents a unit, each column a word, and cells 𝑛𝑖𝑗 counts the
number of times word 𝑖 appears in unit 𝑗. Using this table typically denotes a Bag-of-Words
approach in our analyses.

Table 1
A snippet of the unit-word contingency table N extracted from les Misérables. Rows are chapters,
columns are words in the vocabulary, and cell 𝑛𝑖𝑗 counts the number of time word 𝑗 appear in chapter 𝑖.
                                        aller     allumer   apercevoir   bas   bon   ...
                                𝑢101     23         2            6       11     6    ...
                                𝑢102     12         1            0       3      9    ...
                                𝑢103     10         0            5       1      5    ...
                                𝑢104     0          0            1       0      0    ...


   The second table is the unit-entity table, noted E. It has a size of (𝑛 × 𝑝) where 𝑝 is the
number of narrative entities found in the text and cells 𝑒𝑖𝑗 indicates the presence, or the count
for a weighted version, of entity 𝑗 in unit 𝑖. A narrative entity, in the context of this article, can
be loosely defined in order to be flexible for various types of texts or analyses. It can roughly
be seen as a recurring object with some importance in the narration. For example, it can be
a location, an object, a character, a pair of characters (or even a triplet, a quadruplet, etc.), an
oriented character interaction (e.g. a dialog), or even a particular recurring event containing
multiple characters (e.g. a meeting). In this article, we mostly consider characters and pairs
of characters as entities, as shown in Table 2. Note that in the present case, we consider that
a character or a pair of characters are present in the unit if character names (or aliases) are
detected above a fixed threshold. A weighted version of this table, where 𝑒𝑖𝑗 contains the number
of occurrences of the entity 𝑗 in the unit 𝑖, is also possible. However, equations presented in
this article are written for the presence/absence version.
   This data organization already gives an orientation to subsequent analyses and should be
kept in mind by the practitioner. Textual units are now considered as individuals (in the
statistical terminology), defined by their variables contained in the different columns of both
tables. Moreover, subsequent analyses are oriented in searching how the unit-entity table E
has an influence over the unit-word table N, i.e. searching which words are over-represented

1
    https://github.com/gguex/char2char_vectors.


                                                            45
Guillaume Guex CEUR Workshop Proceedings                                                                43–61


Table 2
A snippet of the unit-entity table E extracted from les Misérables. Rows are chapters, columns are
characters (left) and character pairs (right), and cell 𝑒𝑖𝑗 denotes if 𝑗 appear in chapter 𝑖.
            Cosette   Thénardier   Valjean   ...          ...   Cosette-Thénardier   Cosette-Valjean   ...
     𝑢101      1          1          0       ...   𝑢101   ...           1                  0           ...
     𝑢102      1          1          0       ...   𝑢102   ...           1                  0           ...
     𝑢103      1          1          0       ...   𝑢103   ...           1                  0           ...
     𝑢104      1          0          1       ...   𝑢104   ...           0                  1           ...


or under-represented considering the entities within a specific unit. While an authors uses
characters in order to build her/his narrative, we, to a certain extent, work backward: we are
searching how character appearances and interactions in the textual unit act on her/his choice
of words. If the extraction method permits it, a practitioner should include all entities which
she/he desires to study. Here, for example, the choice to include character pairs along with
characters is motivated by the fact that we are interested in studying character relationships. A
character pair can roughly be seen as an interaction between two characters, and this interaction
should be considered as an object of its own: the presence of this interaction in a unit does not
result in having a mixture of words used for each character, but rather gives a specific flavor to
the unit.
   This data organization also highlights the importance of choosing a proper size for the
units. These units should be large enough to contain enough words in order to properly
capture the textual specificity of each unit, but not too large, as each unit should ideally capture
particularities about one of the entities. Unfortunately, it is impossible to define an ideal size
for all types of analysis. This size should be balanced regarding the level of analysis, the text
size, the selected entities, and previous knowledge of the studied work.
   The use of a contingency table N to represent the textual resource present in the units
denotes a Bag-of-Words approach. Using this approach loses the information relative to the
order of words in the units, but permits to transform a chain of characters, improper to statistical
analyses, into a contingency table, a well studied mathematical object which allows the use of
various kinds of computational methods. The next section shows a particular direction on how
to use this table, with the help of embeddings.

2.2. Embedding of textual units
Various methods can be performed on the contingency table N in order to extract information
from it. Here, we make the choice to extract a lower dimensional, numeric representation of
each unit, in other words, a textual unit vector located in an embedding space.
In section 2.3, these vectors of textual units are used as anchor points in order to also embed
entities into the same space. Therefore, it is crucial that an interpretation about the directions or
the regions of this embedding space is possible, in order to properly interpret the localization of
entity vectors (the relative position of entity vectors among themselves is generally insufficient).
For that reason, we focus on embeddings of textual units which also contain vectors of words:
by examining the positions of entities relatively to word vectors, entities can be depicted.
We propose two embeddings verifying this condition: Section 2.2.1 describes Correspondence


                                                   46
Guillaume Guex CEUR Workshop Proceedings                                                        43–61


Analysis (CA) and section 2.2.2 focuses on Pre-trained Word Vectors (WV).

2.2.1. Correspondence Analysis (CA)
Using Correspondence Analysis (CA) in order to analyze textual resources has a long tradition
[21]. It has the advantage to naturally provide an embedding space, the factorial map, where
units are placed alongside word vectors, and allows the interpretation of the placement of units
in terms of word frequency profiles. Units and word vectors in the embedding space have a
direct interpretation in terms of chi2 distance between profiles.
   By performing a Correspondence Analysis on table N, we get 𝑛 vectors x1 , . . . , x𝑛 corre-
sponding to units (rows) and 𝑣 vectors w1 , . . . , w𝑣 corresponding to words (columns). Each
of these vectors has a size of min(𝑛, 𝑣) − 1, which will generally be 𝑛 − 1. For a detailed
computation of quantities in CA, see Appendix A.1.
   An association score between a particular unit 𝑖 and a word 𝑗 is expressed through the scalar
product between their vectors
                                          𝑎𝑖𝑗 := x⊤ 𝑖 w𝑗 .                                   (1)
A positive (resp. negative) association score denotes an over-representation (resp. under-
representation) of the word 𝑗 in 𝑖, which permits to find lists of words characterizing the
different units. Note that in this article, this association score is rather computed between a
word vector and an entity vector, since the latter, as we will see in section 2.3, lies in the same
space as unit vectors. We could also track how units (or entities) are dissimilar to each others
by using this time the Euclidean distance between vectors.
   Note that vectors x1 , . . . , x𝑛 obtained from CA reflect textual unit profile (in terms of words)
regarding the mean profile (the origin in the factorial map). This analysis is thus contrastive:
it highlights unit variations in the studied text. It means that the particular tone of the whole
studied text might be hidden in this analysis and only the variation around this tone will be
revealed. It might lead to the situation where the (absolute) feeling experienced by the reader
will not appear in this analysis, e.g., a sad character in a sad book might appear joyful if he is less
sad than the mean tone. This can become problematic when this method is used sequentially to
study multiple works: particularities of each book will be hidden. Another limitation with this
approach is that the words helping the interpretation of units (and entities) are contained in
the studied text. Approaches requiring to study the position of units and entities relatively to a
predefined list of words (e.g., friends, enemies, family) might therefore be impossible if these
words do not appear in the text.

2.2.2. Pre-trained Word Vectors (WV)
Pre-trained Word Vectors (WV), based on methods such as Word2Vec [22], GloVe [23], fastText
[24], or Bert [25] have received great attention from various fields in the last decade. They are
generally obtained through a training on a very large corpus, such as Wikipedia or Common
Crawl, and the resulting embedding contains a large quantity of word vectors. As shown by
multiple studies (see [26] for a survey), these vectors are placed in order to reflect semantic
and syntactic relationships between words, and are used in various applications. We focus
here on static word embeddings, where word vectors are fixed and do not depend on their


                                                  47
Guillaume Guex CEUR Workshop Proceedings                                                             43–61


context, obtained by, e.g., fastText. The reason is that we need to have interpretable regions in
an unchanging embedding space.
   There exist multiple methods which use pre-trained word vectors in order to derive vectors
for a group of words, such as sentences [17, 27], paragraphs [28], or documents [29]. These
derived vectors are often used to apply a classification or clustering algorithm on the newly
embedded objects, or to query information [27, 29]. In order to derive these vectors, the majority
of methods use frequencies of words found in objects, i.e. a table similar to N, but apply various
weighting schemes and normalizations in order to reduce the effects of frequent words and
to standardize vectors. In the present article, we use a methodology proposed in [27] as it is
compatible with multiple unit sizes and gives good results in many tasks. Thus, textual units
vectors x1 , . . . , x𝑛 are obtained through the table N and with the method detailed in Appendix
A.2.
   An association score can again be computed between a unit (or an entity) vector x𝑖 and word
vector w𝑗 through the cosine similarity, defined by
                                                   x⊤ w𝑗
                                         𝑎𝑖𝑗 := √︁ 𝑖                                                     (2)
                                                  x⊤     ⊤
                                                   𝑖 x𝑖 w𝑗 w𝑗

Note that, with word vectors, this cosine similarity also permits to compare units (or entities)
between themselves.
   With the pre-trained word vector method, the unit vectors x1 , . . . , x𝑛 (and entity vectors
in section 2.3) lie in an absolute space defined by the pre-trained word vectors. Comparison
between different texts is therefore more pertinent, and associations with words absent from
the corpus can be made. However, it is possible that all units from a given text will be located
in the same region of the space if the vocabulary used in it is very specific. In this case, the
list of most associated word vectors might be similar for every unit, and the analysis will not
give satisfying results. This effect is fortunately limited by the centration of unit vectors which
occurs in the method described in Appendix A.2.

2.3. Entity embeddings
The main goal of this article is not to analyze units, but rather entities, i.e., the 𝑝 columns of
table E. While we use the table N to build embeddings of units, we utilize the table E in order
to build the entity vectors y1 , . . . , y𝑝 relatively to unit vectors x1 , . . . , x𝑛 . Two propositions of
methods are made: the centroids method (CENT), described in section 2.3.1; and the regressions
method (REG), explained in section 2.3.2. Both methods can be combined with the embeddings
of units defined in the previous section.

2.3.1. Centroids (CENT)
This method is the most trivial and is based on the following intuition: an entity is characterized
equally by all units in which it appears. In other words, we can define the vector y𝑘 for entity 𝑘
as
                                                𝑛
                                               ∑︁
                                         y𝑘 =      𝑓𝑖 𝑒𝑖𝑘 x𝑖                                     (3)
                                                   𝑖=1


                                                    48
Guillaume Guex CEUR Workshop Proceedings                                                       43–61


where 𝑓𝑖 = 𝑛𝑛∙∙
              𝑖∙
                 is the relative weight of unit 𝑖. y𝑘 indicates the center of mass, or centroid, of
the units containing the entity. This way of building entity vectors is closely related to the
treatment of supplementary variables found in CA: these variables do not act in the choice of
factorial axes, but can still be represented afterward. However, by contrast, entity vectors are
not dilated after computing centroids, which means that they lie in the same space as units
(row).
   An important remark about the centroid method is that entity vectors positions are additive,
i.e. we have                          ∑︁                     ∑︁
                                𝑒𝑖𝑘 =    𝑒𝑖𝑔 , ∀𝑖 =⇒ y𝑘 =        y𝑔 ,                           (4)
                                     𝑔∈𝒢                      𝑔∈𝒢

where 𝒢 is a subset of entities. This property can be interpreted as followed: if a character 𝑘
can be divided among different situations 𝑔 (the character alone, the character in interaction
with another character, etc.), the character vector y𝑘 is in fact the sum of all vectors y𝑔 of these
situations. This is not necessarily an undesirable property, but it implies that the specificities of
the lone character might be hidden if he is often registered in an interaction. By contrast, if
we consider that an interaction between two characters is an emerging situation, unrelated to
prior behaviors of characters, the regressions method described in the next section seems more
appropriate.

2.3.2. Regressions (REG)
When building a regression model with multiple explanatory variables, it is possible to also
include their interactions. By doing so, we suppose that the effect of raising both variables is not
the same as raising each variable independently. Regression models seem therefore appropriate
to capture specificities of having a particular entity in a textual unit. For example, in the case of
character pairs, the presence of a character 𝑎 will have a effect on the vocabulary of an unit, the
presence of another character 𝑏 will have another effect, and the presence of the pair {𝑎, 𝑏} yet
a different effect. Now, dependent variables in regression models still need to be defined. In
fact, we are doing 𝑑 regressions, with 𝑑 the number of dimensions of the embedding, and each
regression is constructed to predict the 𝛼-th coordinate of units by using binary variables in
the table E. In matrix notation, all regression models can be written as

                                           X = EB
                                               ̃︀ + Σ,                                            (5)

where X = (𝑥𝑖𝛼 ) is the (𝑛 × 𝑑) matrix containing unit vectors (on rows), E     ̃︀ is the matrix E
with a first additional column of 1 for the intercept, B = (𝛽𝑘𝛼 ) is the ((𝑝 + 1) × 𝑑) matrix
containing intercepts and regression coefficients (each column corresponds to one regression),
and Σ the (𝑛 × 𝛼) matrix containing normal errors.
  Intercepts and coefficients estimations B ̂︀ = (𝛽̂︀𝑘𝛼 ) can be considered as our embeddings for
entities as well as for the intercept, which represents the general tone of the studied text. We
therefore denote these estimates with Y = (𝑦𝑘𝛼 ) in the following, with the notation convention
𝑦0𝛼 for intercept coordinate 𝛼.
  As the number of entities (i.e. predictors) might be very large, it is a good idea to add a 𝐿2
regularization term in the objective function. Moreover, the quadratic error rate should also be


                                                 49
Guillaume Guex CEUR Workshop Proceedings                                                       43–61


weighted by the number of tokens in each unit. Including all this, we find the solution for our
intercept and entity vectors y0 , y1 , . . . , y𝑝 , contained in the rows of Y, with

                                ̃︀ ⊤ Diag(f )E
                           Y = (E            ̃︀ + 𝜆I(𝑝+1) )−1 E
                                                              ̃︀ ⊤ Diag(f )X,                     (6)

where Diag(f ) is the diagonal matrix containing weights of units f = (𝑓𝑖 ), 𝜆 > 1 is the
regularization coefficient, and I(𝑝+1) is the identity matrix of size ((𝑝 + 1) × (𝑝 + 1)).
   An interesting effect of the regularization coefficient is that if 𝜆 is high, equation (6) becomes
Y ≈ 𝜆1 Ẽ︀ ⊤ Diag(f )X, which is similar to equation (3) with a contraction term 𝜆. In fact, the
regressions method with a regularized term interpolates between the hypothesis where we
suppose that every entity should be considered independently (with 𝜆 → 0), to the hypothesis
of additive mixture between entities (with 𝜆 → ∞), as discussed in section 2.3.1. Choosing
an appropriate 𝜆 according to the study (how is another, difficult question) might lead to a
situation revealing desirable information about entities.


3. Case study : Les Misérables
At the time of writing, it is not possible to evaluate the exposed framework with some kind
of metric, which would allow to test its pertinence on various corpora. In order to see if
the methods give coherent results, we have to carefully scrutinize and compare them with
previous knowledge of the studied work. For this reason, and because of method variations
and multiplicity of the results (and lack of place), we chose to present only one case study: the
analysis of characters and relationships in Les Misérables, by Victor Hugo. The choice of this
work is motivated by the fact that it is a large corpus, well-known, immensely studied, and
containing various colorful characters and characters relationships. Therefore, it is a strong
choice to clearly illustrate the potential of the exposed framework.

3.1. Preprocessing
The five volumes of Les Misérables, in French, were extracted from Project Gutenberg 2 , while
headers and footers of each file were manually removed. The whole text was lower cased,
lemmatized, and stopwords3 and punctuation were removed. Volumes, books, and chapters
breaking points were kept for later uses.
   We chose to use chapters as textual units. The table N (Figure 2.1) was built by considering
words appearing at least 20 times in the text and resulted in a table of size 365 chapters × 1974
words.
   Characters were detected using Flair 4 NER tools [30]. In order to unify characters and to
further refine the results, we used hand-made lists of character names and aliases from NER
results. It resulted in the detection of 54 characters. The entities considered in table E (Figure
2.1) are composed of 54 single characters and 547 character pairs, resulting in a table of size

2
  https://www.gutenberg.org/.
3
  from a list made by Jacques Savoy http://members.unine.ch/jacques.savoy/clef/frenchST.txt.
4
  https://github.com/flairNLP/flair.


                                                       50
Guillaume Guex CEUR Workshop Proceedings                                                     43–61


365 × 601. A character (resp. a pair of characters) is considered present if it is (resp. both are)
detected at least 2 times in the chapter.
   Note that, in section 3.3.3, we also tested experiments with entities consisting in characters
and character pairs as found in each volume (e.g. Cosette-Valjean in volume one and Cosette-
Valjean in volume two are now two different entities), with the addition of volume constants
(𝑉𝑖 = 1 in volume 𝑖 and 𝑉𝑖 = 0 in other volumes) in order to isolate volume specific vocabulary.
This new table Evol , containing 1124 entities, permits to see a diachronic evolution of words
associated with volumes, characters, and character relationships.

3.2. Methods
There are two types of methods for unit embeddings, CA (section 2.2.1) and WV (section 2.2.2),
as well as two methods to derive entity embeddings from them, CENT (section 2.3.1) and REG
(section 2.3.2), making a total of 4 possibles ways for obtaining entity embeddings.
   The CA method do not need any external data, and results in vectors in a 364-dimensional
space, while the WV methods is based on pre-trained word vectors using fastText [24] trained
on Common Crawl.5 For French, the number of word vectors is around two million and the
dimension of the vector space is 300.
   Note that, in addition to having two tables E and Evol , four methods, and a considerable
number of words and entities, results can also be presented in various ways (similarities between
entities, associations between entities and words, etc.). Thus, we chose to show here a selection
of results for the each method: the 5 most associated words regarding a subset of entities
(section 3.3.1), the 5 most associated entities regarding a subset of words (section 3.3.2), and a
diachronic study of the 5 most associated words for a subset of entities (section 3.3.3). We invite
curious readers to consult results for all words and entities, which can be found in our GitHub
repository.6

3.3. Results
3.3.1. The most associated words for a subset of entities
The first result in this section presents the most associated words with a subset of entities, as
measured by the association score defined in section 2.2. Results can be found in Table 3 for all
methods.
   We can observe that CA methods seems to summarize entities with a vocabulary closer to
the work, while WV methods tend to frequently use words with a wider scope, with notably
more verbs. It results in having the WV methods giving a general feeling for the tone used
for describing characters and relationships, while the CA methods can depict very specific
objects, locations or events associated with these entities. This behavior can be understood by
the nature of unit embeddings: in the WV embedding, word vectors are fixed and do not take
into account the actual frequencies of words found in the studied corpora. A character can be
close to a word appearing only a few times (or none) in the corpus if this word is located near

5
    https://fasttext.cc/docs/en/crawl-vectors.html.
6
    in the "results" folder in https://github.com/gguex/char2char_vectors.


                                                           51
Guillaume Guex CEUR Workshop Proceedings                                                                     43–61


Table 3
The 5 most associated words (association score in parentheses) to a selected set of entities, regarding
CA-CENT, CA-REG, WV-CENT, and WV-REG methods. Words appearing at least two times within
the same method are in bold.
                 Cosette          Cosette-Marius     Cosette-Valjean             Marius               Valjean
              poupée (0.7)           noce (1.72)        noce (1.0)          théodule (0.61)     mestienne (0.51)
               noce (0.68)        mariage (1.31)     mestienne (0.97)       jondrette (0.59)     fossoyeur (0.46)
            mestienne (0.58)        marié (1.21)      mariage (0.71)          ursule (0.56)       accusé (0.45)
             mariage (0.48)         marier (1.11)      marié (0.68)           vernon (0.53)        maire (0.39)
  CA-CENT


              marié (0.48)           baron (1.0)     corbillard (0.65)         tante (0.52)         jean (0.37)
             Marius-Valjean             Javert        Javert-Valjean             Myriel           Myriel-Valjean
                noce (1.2)         accusé (1.47)      accusé (1.85)      conventionnel (5.03)    chandelier (6.28)
             mariage (0.85)          arras (1.04)     avocat (1.12)           évêque (3.54)      gendarme (5.06)
              ursule (0.85)       mouchard (0.97)      preuve (1.1)          oratoire (3.39)       panier (4.72)
               marié (0.8)         avocat (0.96)     président (1.08)         hôpital (2.57)      couvert (4.64)
             tableau (0.74)        preuve (0.93)       forçat (1.01)       cathédrale (2.54)        deuil (4.52)
                 Cosette          Cosette-Marius     Cosette-Valjean             Marius               Valjean
               seau (1.23)          amant (0.83)      blessure (0.78)       jondrette (1.76)      matelas (1.02)
             poupée (0.86)         mariage (0.73)       noce (0.76)          réchaud (1.26)     chandelier (0.87)
             ravissant (0.7)       entraîner (0.7)       file (0.6)           galetas (1.11)       toulon (0.82)
              source (0.65)          noce (0.67)     corbillard (0.58)        bouge (1.05)       fossoyeur (0.79)
  CA-REG


             rassurer (0.61)       volupté (0.63)    mestienne (0.58)        tableau (0.93)         pelle (0.76)
             Marius-Valjean             Javert        Javert-Valjean             Myriel           Myriel-Valjean
               égout (1.1)           arras (1.09)      accusé (1.04)     conventionnel (2.99)       deuil (1.14)
               vase (1.08)           roue (0.89)        nier (0.79)           évêque (1.76)     chandelier (1.07)
               issue (1.07)        bonjour (0.83)       quai (0.54)        cathédrale (1.14)      aveugle (1.01)
                sable (1.0)          malle (0.8)       avocat (0.53)          prêtre (1.11)        panier (0.94)
              couloir (0.98)      cabriolet (0.76)    fonction (0.5)      philosophie (1.06)     gendarme (0.89)
                   Cosette        Cosette-Marius      Cosette-Valjean           Marius                 Valjean
                 jean (0.34)        aimer (0.38)          jean (0.6)       embrasser (0.36)          jean (0.56)
                dormir (0.28)        rêver (0.34)      jacques (0.3)        essayer (0.36)        habiller (0.27)
              regarder (0.26)      vouloir (0.32)     philippe (0.26)        avouer (0.36)          poser (0.26)
  WV-CENT


              habiller (0.26)       douter (0.32)     habiller (0.26)       vouloir (0.35)        jacques (0.26)
                 voir (0.25)       avouer (0.32)     pantalon (0.25)          voir (0.35)        pantalon (0.25)
              Marius-Valjean            Javert         Javert-Valjean            Myriel           Myriel-Valjean
                 jean (0.35)        saisir (0.34)        jean (0.54)        évêque (0.59)         évêque (0.55)
            questionner (0.31)       jean (0.34)        denis (0.31)     archevêque (0.52)      archevêque (0.46)
               essayer (0.31)       placer (0.31)      jacques (0.3)         prêtre (0.45)          prêtre (0.42)
                 oser (0.31)        retirer (0.29)       saisir (0.3)         abbé (0.39)            âme (0.42)
                poser (0.29)       dégager (0.29)     philippe (0.28)      souverain (0.38)          abbé (0.39)
                   Cosette        Cosette-Marius      Cosette-Valjean           Marius                 Valjean
            contempler (0.29)      éternel (0.35)         rue (0.44)        regarder (0.38)          jean (0.56)
                emplir (0.29)      amour (0.35)          jean (0.41)           voir (0.36)        pantalon (0.28)
            doucement (0.27)      humanité (0.34)     faubourg (0.41)       refermer (0.34)        jacques (0.26)
            envelopper (0.26)        âme (0.32)      boulevard (0.41)        glisser (0.34)       philippe (0.23)
  WV-REG


              illuminer (0.26)      vérité (0.32)      quartier (0.34)        poser (0.31)         glisser (0.23)
              Marius-Valjean            Javert         Javert-Valjean            Myriel           Myriel-Valjean
                  rue (0.35)        serrer (0.34)         rue (0.35)        évêque (0.43)            ange (0.37)
            boulevard (0.35)       glisser (0.34)    boulevard (0.34)          divin (0.4)        évêque (0.31)
             souterrain (0.35)      forcer (0.34)      autorité (0.33)       humble (0.39)           âme (0.31)
                bastille (0.35)    bouger (0.33)         civil (0.33)         bonté (0.38)         amour (0.29)
              carrefour (0.34)       aller (0.32)          loi (0.33)     archevêque (0.37)         aurore (0.28)


                                                            52
Guillaume Guex CEUR Workshop Proceedings                                                                                              43–61


the vocabulary associated with this character, as semantically similar words are located in the
same region of space. By contrast, CA will generally takes into account word frequencies along
with specificities in order to describe an entity, and semantically similar words can be located
far away from each other.
   Another remark can be made about the difference between CENT methods and REG meth-
ods. As expected, we see that the CENT methods reveal their additive construction between
characters and relationships: words used to describe a relationship rob off on their character
descriptions (see e.g. Cosette, Cosette-Marius, and Cosette-Valjean). By contrast, the REG
methods display more "perpendicular" descriptions of entities, with fewer words repeating.
   Note that we did not show here the least associated words with each entity, as they are fre-
quently the same for all methods and all related to the long description of the Battle of Waterloo
in volume 2 ("infanterie", "wellington", "cuirassier", "bridage"), containing no protagonist of the
story.
   Overall, we find that the CA-REG method provides the most satisfying results, with pertinent
words associated with each entity and a high variety in the choice of words.

3.3.2. The most associated entities for a subset of words

Table 4
The 5 most associated entities (association score in parentheses) to a selected set of words, regarding
CA-CENT, CA-REG, WV-CENT, and WV-REG methods.
                           aimer                             rue                             justice                         guerre
                  Dahlia-Fameuil (1.12)         Courfeyrac-Fauchelevent (0.79)        Azelma-Babet (1.09)       Combeferre-Fauchelevent (1.35)
 CA-CENT


                  Dahlia-Listolier (1.12)        Courfeyrac-Toussaint (0.79)         Azelma-Brujon (1.09)            Feuilly-Valjean (1.27)
                 Fameuil-Zéphine (1.12)          Eponine-Fauchelevent (0.79)       Azelma-Claquesous (1.09)          Feuilly-Marius (1.22)
                 Listolier-Zéphine (1.12)         Eponine-Gavroche (0.79)            Azelma-Magnon (1.09)            Lesgle-Valjean (1.15)
             Gillenormand-Toussaint (1.11)        Eponine-Pontmercy (0.79)        Azelma-Montparnasse (1.09)        Mabeuf-Valjean (1.15)
                  Cosette-Marius (0.35)               Courfeyrac (0.37)               Javert-Valjean (0.37)       Grantaire-Pontmercy (0.94)
 CA-REG


                       Myriel (0.31)              Grantaire-Prouvaire (0.29)      Champmathieu-Valjean (0.37)          Grantaire (0.56)
              Basque-Fauchelevent (0.25)             Cosette-Javert (0.26)                Myriel (0.21)            Marius-Pontmercy (0.55)
                  Myriel-Valjean (0.22)            Marius-Prouvaire (0.22)              Grantaire (0.18)               Pontmercy (0.55)
           Fauchelevent-Gillenormand (0.21)             Enjolras (0.22)              Grantaire-Javert (0.16)            Enjolras (0.49)
                Cosette-Marius (0.38)            Grantaire-Prouvaire (0.57)           Azelma-Babet (0.34)              Grantaire (0.32)
 WV-CENT


                Fantine-Marius (0.34)             Marius-Prouvaire (0.54)            Azelma-Brujon (0.34)         Combeferre-Lesgle (0.32)
              Fantine-Pontmercy (0.34)              Cosette-Javert (0.43)          Azelma-Claquesous (0.34)          Feuilly-Lesgle (0.25)
             Basque-Fauchelevent (0.34)       Magnon-Monsieur Thénardier (0.37)      Azelma-Magnon (0.34)         Combeferre-Marius (0.25)
               Prouvaire-Valjean (0.33)               Gavroche (0.36)             Azelma-Montparnasse (0.34)     Combeferre-Grantaire (0.25)
               Prouvaire-Valjean (0.34)           Grantaire-Prouvaire (0.8)       Champmathieu-Valjean (0.38)    Grantaire-Pontmercy (0.39)
 WV-REG


           Champmathieu-Chenildieu (0.31)         Marius-Prouvaire (0.77)            Azelma-Brujon (0.35)           Enjolras-Marius (0.35)
              Brevet-Chenildieu (0.31)               Courfeyrac (0.66)             Azelma-Claquesous (0.35)            Grantaire (0.31)
              Brevet-Cochepaille (0.31)             Cosette-Javert (0.64)            Azelma-Magnon (0.35)         Combeferre-Lesgle (0.31)
           Champmathieu-Cochepaille (0.31)            Prouvaire (0.63)            Azelma-Montparnasse (0.35)       Cosette-Gavroche (0.27)


   These results are extracted from a transposed table, and display the most associated entities
to a selected set of words. They can be found in Table 4. This type of results can be seen as
queries, made from a single word by a practitioner, which output the most associated entities in
the work related to that query. We chose here to show top entities related to words "aimer",
"rue", "justice" and "guerre", as they represent some of the main topics of the book. In this task
again, from our point of view, the CA-REG displays the most accurate results: the main love
relationship (Cosette-Marius) of the book is the most associated entity for "aimer", several "amis
de l’ABC" (a revolutionary group) are most associated with "rue", the cop-suspect relationship


                                                                         53
Guillaume Guex CEUR Workshop Proceedings                                                           43–61


Figure 1: Resulting weighted and signed networks between main characters, with examples of word
queries ("aimer", "rue", "justice", and "guerre"). These networks are computed with CA-REG method
(𝜆 = 0.01). Red indicates positive affinity, blue negative affinity, and edge width is proportional to the
number of detected interactions between characters.


(Javert-Valjean) is the top entity for "justice", and military officers or bellicose characters are
associated with "guerre". While somewhat inferior with the selected set of queries, WV methods
have the advantage of being able to query words outside the scope of the book, as the pre-trained
word embedding possesses a very large vocabulary.
   Note that another way to display these results is through weighted signed networks, as found
in Figure 1 (for CA-REG). The network structure represents the number of times characters
are detected together (which do not depend on the query), and the signed weights (edge color)
display association score between character relationship (edges) and the queried word. This
representation gives a quick visual support in order to explore the studied work and could be
implemented as a standalone program.

3.3.3. A diachronic study of the most associated words for a subset of entities
These results are obtained from the table Evol where entities are considered different based on
the volume. By doing so, it permits to track the evolution of association scores along the book.
Additionally to entities, we can also define a constant term 𝑉𝑖 for each volume 𝑖, which absorbs
the associated words with each volume. Results for constants and a subset of entities (Valjean,
Cosette, Cosette-Valjean) can be found in Table 5. Note that we did not show CENT results in
this table, as they are similar to the one found in Table 3: words are often repeated for different


                                                   54
Guillaume Guex CEUR Workshop Proceedings                                                                    43–61


Table 5
The 5 most associated words (association score in parentheses) vs volumes constant, Valjen, Cosette, and
Cosette-Valjean, as found in each volume regarding the CA-REG and WV-REG methods (𝜆 = 0.01).
                      𝑉1                  𝑉2                  𝑉3                     𝑉4                   𝑉5
               huissier (1.4)    cuirassier (3.38)      gamin (1.92)         émeute (1.48)          sable (3.04)
                 hôte (1.11)      infanterie (2.9)       mine (1.38)          révolte (0.86)       berge (2.32)
                arras (0.92)    sacrement (2.69)         farce (0.81)    bourgeoisie (0.84)        égout (2.16)
               lampe (0.91)        brigade (2.41)     ignorance (0.74)     populaire (0.82)         voûte (1.9)
            montreuil (0.81)        division (2.4)     jondrette (0.7)   insurrection (0.81)        vase (1.89)
                 Valjean 1             Valjean 2           Valjean 3             Valjean 4            Valjean 5
            chandelier (1.03)         pelle (2.6)       ursule (1.49)      réverbère (0.98)       matelas (2.54)
                toulon (0.8)     fossoyeur (2.35)    luxembourg (1.06)      hausser (0.45)         ronde (1.96)
               gervai (0.73)        pioche (1.51)      tableau (0.93)     promenade (0.37)         galerie (1.06)
               bagne (0.65)          carte (1.39)         banc (0.81)       lanterne (0.36)      lanterne (0.99)
  CA-REG


               maire (0.57)     mestienne (1.35)      mouchoir (0.77)          tuyau (0.35)          rive (0.99)
                 Cosette 1             Cosette 2           Cosette 3             Cosette 4            Cosette 5
              gargote (0.59)          seau (1.48)              -           ravissant (1.11)         encre (0.76)
              balayer (0.57)       poupée (0.99)               -              céleste (0.76)       plume (0.59)
              alouette (0.56)       source (0.84)              -             volupté (0.67)         noce (0.49)
             servante (0.43)       gargote (0.71)              -              frémir (0.64)     chandelier (0.48)
                mois (0.34)     mestienne (0.64)               -               lancier (0.6)   antichambre (0.47)
           Cosette-Valjean 1    Cosette-Valjean 2    Cosette-Valjean 3   Cosette-Valjean 4     Cosette-Valjean 5
              maladie (0.49)        façade (0.68)              -           promenade (0.5)          noce (1.27)
             médecin (0.48)      corbillard (0.66)             -              chaîne (0.47)        marié (0.93)
              demain (0.33)      mestienne (0.6)               -            blessure (0.46)        mardi (0.89)
              surprise (0.28)     bâtiment (0.56)              -               tuyau (0.45)      mariage (0.86)
               auprès (0.28)           cul (0.55)              -         luxembourg (0.44)            file (0.65)
                      𝑉1                  𝑉2                  𝑉3                     𝑉4                   𝑉5
            demander (0.36)          saint (0.39)       gamin (0.45)        violence (0.44)        égout (0.52)
               décider (0.3)         mont (0.39)        garçon (0.42)          haine (0.42)         quai (0.45)
                 aider (0.3)      régiment (0.38)        jeune (0.36)         révolte (0.42)          rue (0.44)
             expliquer (0.29)     chapelle (0.38)       enfant (0.35)      souffrance (0.4)           eau (0.42)
             plaindre (0.29)     infanterie (0.36)        père (0.34)       étincelle (0.39)     chaussée (0.42)
                 Valjean 1             Valjean 2           Valjean 3             Valjean 4            Valjean 5
              essayer (0.29)          jean (0.54)      admirer (0.32)           jean (0.71)          jean (0.72)
             réfléchir (0.27)      jacques (0.33)       passer (0.31)        jacques (0.41)      pantalon (0.42)
             expliquer (0.24)     pantalon (0.29)       observer (0.3)     pantalon (0.36)        jacques (0.39)
                 agir (0.24)           mr (0.28)         guetter (0.3)          louis (0.34)     philippe (0.34)
  WV-REG


           questionner (0.24)        denis (0.28)        croiser (0.3)      philippe (0.33)         denis (0.33)
                 Cosette 1             Cosette 2           Cosette 3             Cosette 4            Cosette 5
                  an (0.41)         dormir (0.33)              -               rêver (0.32)         rêver (0.31)
                mois (0.39)       regarder (0.29)              -            regarder (0.31)        mentir (0.29)
                mère (0.36)          sentir (0.28)             -          contempler (0.28)        écrire (0.29)
                 fille (0.35)    endormir (0.28)               -             pleurer (0.27)     demander (0.28)
               enfant (0.33)       respirer (0.28)             -                 lire (0.27)      pleurer (0.28)
           Cosette-Valjean 1    Cosette-Valjean 2    Cosette-Valjean 3   Cosette-Valjean 4     Cosette-Valjean 5
                 voir (0.32)           rue (0.55)              -                jean (0.52)      mariage (0.42)
             entendre (0.31)         ruelle (0.48)             -           pantalon (0.35)          marié (0.4)
             frissonner (0.3)   boulevard (0.45)               -                gilet (0.28)          noce (0.4)
            grommeler (0.3)            mur (0.4)               -                 gris (0.27)          gai (0.33)
               essayer (0.3)       faubourg (0.4)              -            manteau (0.26)          amour (0.3)


                                                          55
Guillaume Guex CEUR Workshop Proceedings                                                      43–61


entities and are less convincing.
   Here again, we see that associated words for the WV give the general tone of volumes and
entities, while CA results are more specific and related to particular events which occurred for
characters. As expected, words associated with volume constants give a short overview of each
volumes, especially with the CA-REG method (e.g. 𝑉2 for the Battle of Waterloo, 𝑉4 for the
barricade event). Associated words with entities also seem accurate in describing them. Note
that Cosette was not detected in volume 3 because she is not explicitly cited (she is often referred
as "the daughter of M. Leblanc"), and this also explains the absence of the Cosette-Valjean pair.


4. Conclusion
In this article, we introduced a general framework in order to automatically extract textual
information about narrative entities from a small corpus or a single work. The framework is built
on two tables, the unit-word table N and the unit-entity table E. This data organization sets
subsequent analyses into a classical statistical framework, where the goal is to see how variables
in E (the entities) affect the variables in N (the vocabulary) for each textual unit. A choice
was taken to use embeddings for analyzing these effects: units and words are embedded using
Correspondence Analysis or pre-trained Word Embedding on N, and entities are embedded in
the same space as units using the Centroids or the Regressions methods on E. These embeddings
are then used in order to see affinities between entities and words, enabling the characterization
of the former by the latter. A case study on Les Misérables was performed to see if methods gave
promising results.
   The first important choice in the analysis is how to define the size of units. Other corpora
were also tested (e.g. Shakespeare plays) and it seems important to define units with at least
a paragraph size (after preprocessing) in order to represent them accurately. Choosing small
units might allow to successfully capture word specificities related to a small subset of entities,
but unit vectors become almost orthogonal one to another if the size of units is too small. This
situation results in an overfitting regime with a high variance and low bias, i.e. units positions
can be distant with the difference of only a few rare words. By contrast, large units will result in
an underfitting regime, with a low variance and high bias, failing to capture entity specificities,
but more robust to particularities in word usages. Having enough units is also important in
order to properly locate entities in the embedding space. In order to analyze characters and
relationships, we advise the practitioner to use their prior knowledge of the work in order to
split the studied narrative as close as possible to "scenes" (as found in theater), which describe a
particular event between an almost constant set of characters.
   The second choice is to select which entities to study. This choice is of course driven by the
problematic, but is also limited by the automatic extraction tools available. These entities can be
various, but must appear frequently in the work in order to be placed correctly in the embedding
space. However, it is unadvised to set an entity which is almost always there (e.g. a narrator),
as it will already be represented by the origin in the CENT method or as the constant term in
the REG method. As a rule of thumbs, the number of entities should ideally be lower than
the number of textual units. However, even with an exceeding number of entities (like in our
case study, where we had 601 and 1124 entities for 365 units), if some entities appear rarely,


                                                56
Guillaume Guex CEUR Workshop Proceedings                                                     43–61


analyses are still possible. Note that the version of the table E containing counts of entities
rather than presence among each unit was also tested in experiments, but gave similar results
for the studied corpus.
   The choice of using embeddings, where units, entities and words are located, is motivated by
the fact that the resulting space permits many types of explorations. As presented in this article,
we can extract some of the most (or least) associated words with each entity or rank entities
according to a word query, but other types of measurements could also be made. Entities could
be placed along a particular axis in the space, defined with two sets of contrasted words, in
order to highlight a particular opposition (positive-negative, in order to do sentiment analysis,
introvert-extrovert, friend-enemy, etc.). This approach could also be combined with a clustering
of the words, or a Topic Modeling method, thus permitting to further refine the different regions
in the embedding space. Relative location of entities could also be used in order to cluster or
classify them. All these leads can be explored in future research.
   The difference in the choice between CA and WV embeddings appears quite clearly in the
results. CA highlights particular words associated with entities, very specific to the studied
work and the narrative events found in it, while WV gives a general feeling of the tone of the
text when these entities are present. This difference is explained by the fact that CA focuses
on words appearing within the work, with possibly very different locations to semantically
similar words, while WV word vectors are positioned regarding their semantic and syntactic
similarities. An entity located in the WV space will then be in a semantic or syntactic region,
and its characterizing words should all be related. Results show that CA methods generally
perform better to quickly interpret entities among the narrative, but might be limited for some
applications. As a matter of fact, the advantage of WV embeddings is that its space is absolute,
permitting the comparison of results between sequentially studied texts, and also contains a
larger vocabulary in its embedding. This last property could be used in order to use a fixed
list of relationship attributes (e.g., friend, enemy, family, colleague), which do not necessarily
appear in every text in the studied corpus, in order to categorize character relationships.
   The choice between the CENT method and the REG method is relatively easy: thanks to its
hyperparameter 𝜆, the REG method can give similar results than the CENT method when 𝜆
is high (with only a contraction of entity vectors), but also gives more "perpendicular" sets of
words describing entities when 𝜆 is low. Thus, it is clearly a superior choice in order to give
a variety of results. The choice for this hyperparameter 𝜆 depends on what the practitioner
desires. If her/his entities are defined such that some of them are completely included into
others, such as character and character pair, and she/he would like to have specificities about
the finer grained entity, the 𝜆 must be set to a low value. By contrast, if she/he do not mind
in having some of her/his entities described like a mixture of others, she/he can set 𝜆 to a
high value. However, very low values of 𝜆 should be avoided if the number of entities is high
compared to the number of units, as this will lead to an overfitting of regression coefficients
and result in the association of very rare and specific words with entities.
   Finally, the biggest weakness of this framework yet is the difficulty to validate its pertinence.
Several other case studies, with results carefully scrutinized regarding prior knowledge, should
be undertaken in order to see if results are trustworthy, but this type of experiments are
expensive both in time and human resources. Another idea could be to use annotated corpora
such as described in [31], where human annotators classified character relationships in various


                                                57
Guillaume Guex CEUR Workshop Proceedings                                                     43–61


categories. For example, we could see if the presented method can actually retrieve these
categories by assigning relationships to the category-word with the highest association score.
Such experiments are promising, but it requires an efficient automatic entity tagger, in order to
detect and especially unify characters in a large quantity of documents, and unfortunately, this
tool does not exist yet. Nevertheless, first case studies gave promising results for this framework,
and its flexibility could lead to various applications.


A. Appendix
A.1. Correspondence Analysis
Starting from the (𝑛 × 𝑣) contingency table N = (𝑛𝑖𝑗 ), we define the vector of unit weights
as f = (𝑓𝑖 ) := (𝑛𝑖∙ /𝑛∙∙ ) and the vector of word weights as g = (𝑔𝑗 ) := (𝑛∙𝑗 /𝑛∙∙ ), where ∙
denotes the summation on the replaced index. It is then possible to compute the weighted scalar
product matrix between units K = (𝑘𝑖𝑗 ) with
                                               𝑣
                                              ∑︁
                                                                                                (7)
                                      √︀
                             𝑘𝑖𝑗 :=     𝑓𝑖 𝑓𝑗    𝑔𝑘 (𝑞𝑖𝑘 − 1)(𝑞𝑗𝑘 − 1),
                                            𝑘=1

   where 𝑞𝑖𝑘 = 𝑛𝑗∙ 𝑛∙𝑘 is the quotient of independence of the cell 𝑖, 𝑘. The vector of textual unit
                  𝑛𝑖𝑘 𝑛∙∙

𝑖, x𝑖 = (𝑥𝑖𝛼 ), is obtained by the eigendecomposition of the matrix K = UΛU⊤ and with
                                                  √
                                                    𝜆𝛼
                                           𝑥𝑖𝛼 := √ 𝑢𝑖𝛼 ,                                       (8)
                                                    𝑓𝑖
  where 𝜆𝛼 are the eigenvalues contained in the diagonal matrix Λ and 𝑢𝑖𝛼 the eigenvectors
components found in U. We find the vector of word 𝑗, w𝑗 = (𝑤𝑗𝛼 ), with
                                                   𝑣
                                              1 ∑︁
                                      𝑤𝑗𝛼 := √        𝑓𝑖 𝑞𝑖𝑗 𝑥𝑖𝛼 .                              (9)
                                               𝜆𝛼 𝑖=1
  Note that various other quantities of interest can also be computed in CA, such as
                              𝜆𝛼
                       𝑝𝛼 :=        : the proportion of inertia expressed in 𝛼,
                              𝜆∙
                              𝑓𝑖 𝑥2𝑖𝛼
                      𝑐𝑢𝑖𝛼 :=          : the contribution of unit 𝑖 to axis 𝛼,
                               𝜆𝛼
                              𝑔𝑗 𝑤𝑗𝛼2
                      𝑐𝑤
                       𝑗𝛼  :=           : the contribution of word 𝑗 to axis 𝛼,
                                𝜆𝛼
                                 𝑥2
                      ℎ𝑢𝑖𝛼 := ∑︀ 𝑖𝛼 2 : the contribution of axis 𝛼 to unit 𝑖,
                                 𝛼 𝑥𝑖𝛼
                                 𝑤𝑗𝛼2
                      ℎ𝑤
                       𝑗𝛼  := ∑︀       2 : the contribution of axis 𝛼 to word 𝑗,
                                 𝛼 𝑤𝑗𝛼

For a detailed interpretation of these different quantities, see [21].


                                                  58
Guillaume Guex CEUR Workshop Proceedings                                                         43–61


A.2. Unit embedding based of pre-trained word vectors
This method is justified and detailed in [27]. Let w1 , . . . , w𝑣 be pre-trained word vectors which
appear in the studied corpus, and the (𝑛 × 𝑣) table N counting the frequency of these words in
the 𝑛 textual units. We first construct the uncentered vectors x    ̃︀𝑖 of each unit 𝑖 with
                                              𝑣
                                             ∑︁ 𝑛𝑖𝑗  𝑎
                                     x
                                     ̃︀𝑖 =                 w𝑗 ,                                   (10)
                                                𝑛 𝑎 + 𝑛𝑛∙∙
                                                        ∙𝑗
                                             𝑗=1 𝑖∙

where 𝑎 > 0 is an hyperparameter which gives less importance to frequent words as 𝑎 → 0. In
this article, we set 𝑎 to the recommended value of 0.01. Let X ̃︀ be the matrix whose columns are
vectors x̃︀𝑖 , and u be its first singular vector. We compute vectors x𝑖 of each units 𝑖 with

                                              ̃︀𝑖 − uu⊤ x
                                         x𝑖 = x         ̃︀𝑖 .                                     (11)

This last equation acts like a centration of unit vectors in the direction of the first singular vector
u.


References
 [1] F. Moretti, “Operationalizing”: or, the Function of Measurement in Modern Literary Theory,
     The Journal of English Language and Literature 60 (2014) 3–19.
 [2] T. Underwood, A Genealogy of Distant Reading., DHQ: Digital Humanities Quarterly 11
     (2017).
 [3] M. P. Eve, Close Reading with Computers: Genre Signals, Parts of Speech, and David
     Mitchell’s Cloud Atlas, SubStance 46 (2017) 76–104.
 [4] W. Schmid, Narratology: an introduction, Walter de Gruyter, 2010.
 [5] A. Agarwal, A. Kotalwar, J. Zheng, O. Rambow, SINNET: Social Interaction Network
     Extractor from Text, in: The Companion Volume of the Proceedings of IJCNLP 2013:
     System Demonstrations, Asian Federation of Natural Language Processing, Nagoya, Japan,
     2013, pp. 33–36. URL: https://aclanthology.org/I13-2009.
 [6] S. Chaturvedi, M. Iyyer, H. Daume III, Unsupervised learning of evolving relationships
     between literary characters, in: Proceedings of the AAAI Conference on Artificial Intelli-
     gence, volume 31, 2017.
 [7] J. Li, A. Sun, J. Han, C. Li, A Survey on Deep Learning for Named Entity Recognition,
     IEEE Transactions on Knowledge and Data Engineering 34 (2022) 50–70. doi:10.1109/
     tkde.2020.2981314.
 [8] V. Labatut, X. Bost, Extraction and Analysis of Fictional Character Networks: A Sur-
     vey, ACM Computing Surveys 52 (2019) 89:1–89:40. URL: https://doi.org/10.1145/3344548.
     doi:10.1145/3344548.
 [9] S. Min, J. Park, Modeling narrative structure and dynamics with networks, sentiment
     analysis, and topic modeling, PLOS ONE 14 (2019) e0226025. doi:10.1371/journal.
     pone.0226025.
[10] I. Novakova, D. Siepmann, Literary Style, Corpus Stylistic, and Lexico-Grammatical
     Narrative Patterns: Toward the Concept of Literary Motifs, in: Phraseology and Style in


                                                  59
Guillaume Guex CEUR Workshop Proceedings                                                   43–61


     Subgenres of the Novel, Springer International Publishing, 2019, pp. 1–15. doi:10.1007/
     978-3-030-23744-8_1.
[11] S. Grayson, M. Mulvany, K. Wade, G. Meaney, D. Greene, Novel2vec: Characterising 19th
     century fiction via word embeddings, in: 24th Irish Conference on Artificial Intelligence
     and Cognitive Science, 2016.
[12] R. J. Heuser, Word vectors in the eighteenth century, in: ADHO 2017-Montréal, 2017.
[13] S. J. Kerr, When Computer Science Met Austen and Edgeworth, NPPSH Reflections 1
     (2017) 38–52.
[14] M. Elsner, Character-based kernels for novelistic plot structure, in: Proceedings of the 13th
     Conference of the European Chapter of the Association for Computational Linguistics,
     Association for Computational Linguistics, Avignon, France, 2012, pp. 634–644. URL:
     https://aclanthology.org/E12-1065.
[15] J. Lee, C. Y. Yeung, Extracting Networks of People and Places from Literary Texts, in: Pro-
     ceedings of the 26th Pacific Asia Conference on Language, Information, and Computation,
     Faculty of Computer Science, Universitas Indonesia, Bali, Indonesia, 2012, pp. 209–218.
     URL: https://aclanthology.org/Y12-1022.
[16] Y. Rochat, F. Kaplan, Analyse des réseaux de personnages dans Les Confessions de
     Jean-Jacques Rousseau, Les Cahiers du numérique 10 (2014) 109–133. URL: https://www.
     cairn.info/revue-les-cahiers-du-numerique-2014-3-page-109.htm. doi:10.3166/LCN.10.
     3.109-133, place: Cachan Publisher: Lavoisier.
[17] A. Grener, M. Luczak-Roesch, E. Fenton, T. Goldfinch, Towards A Computational Literary
     Science: A Computational Approach To Dickens’ Dynamic Character Networks (2017).
     doi:10.5281/ZENODO.259499.
[18] G. A. Sack, Character networks for narrative generation: Structural balance theory and
     the emergence of proto-narratives, Complexity and the human experience: Modeling
     complexity in the humanities and social sciences (2014) 81–104.
[19] V. Krishnan, J. Eisenstein, "You’re Mr. Lebowski, I’m the Dude": Inducing Address
     Term Formality in Signed Social Networks, in: Proceedings of the 2015 Confer-
     ence of the North American Chapter of the Association for Computational Linguis-
     tics: Human Language Technologies, Association for Computational Linguistics, 2015.
     doi:10.3115/v1/n15-1185.
[20] F. Incitti, F. Urli, L. Snidaro, Beyond word embeddings: A survey, Information Fusion 89
     (2023) 418–436. doi:10.1016/j.inffus.2022.08.024.
[21] L. Lebart, B. Pincemin, C. Poudat, Analyse des données textuelles, number 11 in Mesure et
     évaluation, Presses de l’Université du Québec, Québec, 2019.
[22] T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient Estimation of Word Representations
     in Vector Space, arXiv:1301.3781 [cs] (2013). URL: http://arxiv.org/abs/1301.3781, arXiv:
     1301.3781.
[23] J. Pennington, R. Socher, C. Manning, Glove: Global Vectors for Word Representation, in:
     Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing
     (EMNLP), Association for Computational Linguistics, Doha, Qatar, 2014, pp. 1532–1543.
     URL: http://aclweb.org/anthology/D14-1162. doi:10.3115/v1/D14-1162.
[24] P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword
     Information, Transactions of the Association for Computational Linguistics 5 (2017)


                                               60
Guillaume Guex CEUR Workshop Proceedings                                                   43–61


     135–146. URL: https://direct.mit.edu/tacl/article/43387. doi:10.1162/tacl_a_00051.
[25] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional
     transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018).
[26] Y. Li, T. Yang, Word Embedding for Understanding Natural Language: A Survey, in:
     Studies in Big Data, Springer International Publishing, 2017, pp. 83–104. doi:10.1007/
     978-3-319-53817-4_4.
[27] S. Arora, Y. Liang, T. Ma, A simple but tough-to-beat baseline for sentence embeddings,
     in: International conference on learning representations, 2017.
[28] Q. Le, T. Mikolov, Distributed representations of sentences and documents, in: International
     conference on machine learning, PMLR, 2014, pp. 1188–1196.
[29] M. Kusner, Y. Sun, N. Kolkin, K. Weinberger, From word embeddings to document distances,
     in: International conference on machine learning, PMLR, 2015, pp. 957–966.
[30] A. Akbik, D. Blythe, R. Vollgraf, Contextual String Embeddings for Sequence Labeling,
     in: COLING 2018, 27th International Conference on Computational Linguistics, 2018, pp.
     1638–1649.
[31] P. Massey, P. Xia, D. Bamman, N. A. Smith, Annotating Character Relationships in Literary
     Texts (2015). arXiv:1512.00728.


                                               61

</pre>