=Paper=
{{Paper
|id=Vol-3602/paper4
|storemode=property
|title=A Framework for Embedding Entities in a Textual Narrative: a Case Study on Les Misérables
|pdfUrl=https://ceur-ws.org/Vol-3602/paper4.pdf
|volume=Vol-3602
|authors=Guillaume Guex
|dblpUrl=https://dblp.org/rec/conf/comhum/Guex22
}}
==A Framework for Embedding Entities in a Textual Narrative: a Case Study on Les Misérables==
A Framework for Embedding Entities in a Textual Narrative: a Case Study on Les Misérables Guillaume Guex Faculty of Arts, Department of Language and Information Sciences, University of Lausanne, bâtiment Anthropole, 1015 Lausanne, Switzerland Abstract In this article, we propose a general and flexible framework in order to study narrative entities found in a literary work. This framework is exposed starting from a broad perspective, consisting in how to segment the work into textual units and organize the resulting data, and is narrowed down to a particular case: the study of characters and relationships found in Les Misérables. A notable choice was made in the current instance of the framework: the construction of embeddings containing both textual units and narrative entities alongside words. These embeddings, where different spatial regions can be interpreted with word vectors, are the keys helping us to characterize studied entities. Four types of embedding methods are constructed, and their results on Les Misérables permit to show the potential of this framework in order to analyzes characters and relationships in a narrative. Keywords Digital Humanities, Distant Reading, Textual Narrative, Narrative Entity, Embeddings, Characters 1. Introduction In the field of Digital Humanities, Distant Reading tools [1] allow researchers to quickly gain knowledge on textual corpora without actually reading them. Purposes of these methods are various, but can be mainly categorized into two groups: in the first case, these methods are used to tag, classify, or summarize large quantities of documents, in order to quickly structure information or to deliver a speech over the whole studied corpus [2]. Methods, in this case, rely heavily on Big Data and make an extensive use of Machine Learning, often with the help of supervised methods. In the second case, researchers use computational methods to underline hidden structures in a small corpus or even a single document, which helps them to refine their understanding of this corpus or to validate hypotheses [3]. Methods in this setting can also rely on Machine Learning, but must typically be built with more caution and attention to details: corpora are smaller, analyses are closer to the work, and methods must be transparent in order to appropriately interpret results. The use of exploratory tools and unsupervised methods is also preferred in this context, as it is less desirable to base methods on information coming from large external corpora. The proposed method in this article typically belongs to the second group, as it is unsupervised and can be applied on a single document. COMHUM 2022: Workshop on Computational Methods in the Humanities, June 9–10, 2022, Lausanne, Switzerland $ guillaume.guex@unil.ch (G. Guex) 0000-0003-1001-9525 (G. Guex) © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 43 CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings Guillaume Guex CEUR Workshop Proceedings 43–61 When a single (or a few) literary work is analyzed, a common practice is to study narrative entities (characters, events, locations, etc.) used by the author in her/his book [4]. Researchers are frequently interested in depicting them and in seeing how they interact with each other in the story. Various computational tools can help them in this task, to name a few: Named Entity Recognition tools [5, 6, 7], Automatic Character Networks Extraction [8], Sentiment Analysis and Topic Modeling [9], Textometry [10], and Word Embeddings [11, 12, 13]. All these methods have been used in order to explicitly show hidden structures constructed by the author in her/his work. It permits to find patterns, and can help to categorize particular narrative constructions, writing styles, or genres. These kinds of methods can be a great complement to classical analyses of literary works as they allow to efficiently summarize information which is otherwise quite diffuse. In this article, we propose a general framework in order to automatically characterize various narrative entities in a literary work. The entire framework is exposed starting from a wide perspective, which is how to organize the textual data, and is narrowed down to a specific use, the study of character relationships in Les Misérables, by Victor Hugo. Along this presentation, various choices are made to highlight a particular use of this framework, but these choices should be viewed as suggestions rather than rules: the real strength of this framework is its flexibility and the direction taken in this article is oriented for a defined task. To be more specific, we will show how to use embeddings in order to locate characters and their relationships alongside the vocabulary. An association measure can then be constructed between these words and entities, which can help a practitioner to depict them. Four variations of this method are proposed, and are tested on Les Misérables. The idea behind this framework comes from the field of automatic extraction and analysis of character networks from literary works (see [8] for a survey). When building character networks from a textual narrative, one of the most widespread methods consists in dividing the studied work into 𝑛 narrative units or contexts 𝑢1 , . . . , 𝑢𝑛 , which can be, e.g., sentences, paragraphs, or chapters, and then counting the number of units where characters co-occurred [9, 14, 15, 16, 17]. Usually, the text constituting these units is discarded and the resulting network displays edges which roughly represent an aggregated number of interactions between characters. However, by doing so, the aggregation occurs on various interactions and gives little information about the type of relationship which exists between characters. Various improvements were proposed in order to weight [18] or sign [19] (or both [9]) the edges in the character networks. A particular inspiration for the current work is the article by Min and Park (2019) [9], where authors also analyzed characters in Les Misérables by building various signed and weighted networks, with the help of Sentiment Analysis and Topic Modeling. The current framework was built in order to generalize this idea of refining character relationships by formalizing the data structure and keeping directions of exploration as wide as possible. Embeddings [20] appeared to us to be the proper tool for achieving this. As a matter of fact, with embeddings, the textual contents of units are transformed into workable mathematical objects (the vectors), usable for various tasks, while conserving a maximum of information. The framework has been further generalized in order to be applicable on different sorts of narrative entities, but the presented case remains the study of character relationships in Les Misérables. The current article is structured as follows. Section 2 defines the framework, with section 2.1 defining the data organization, section 2.2 describing how to embed textual units, and section 44 Guillaume Guex CEUR Workshop Proceedings 43–61 2.3 deriving entity vectors lying in the same space as units. In section 3, we present the specific methodology and results for the case study of character relationships in Les Misérables, and section 4 draws conclusions and perspectives about this work. All (Python) scripts and datasets used in this article, as well as extended results, can be found in the dedicated GitHub repository.1 2. Framework 2.1. Data organization In this article, a textual narrative is divided in 𝑛 textual units 𝑢1 , . . . , 𝑢𝑛 , and is represented through two tables. The first one is well known in the field of textual analysis and consists in the (𝑛 × 𝑣) unit-word contingency table N, as represented by Table 1, where 𝑣 is the vocabulary size. In this table, each row represents a unit, each column a word, and cells 𝑛𝑖𝑗 counts the number of times word 𝑖 appears in unit 𝑗. Using this table typically denotes a Bag-of-Words approach in our analyses. Table 1 A snippet of the unit-word contingency table N extracted from les Misérables. Rows are chapters, columns are words in the vocabulary, and cell 𝑛𝑖𝑗 counts the number of time word 𝑗 appear in chapter 𝑖. aller allumer apercevoir bas bon ... 𝑢101 23 2 6 11 6 ... 𝑢102 12 1 0 3 9 ... 𝑢103 10 0 5 1 5 ... 𝑢104 0 0 1 0 0 ... The second table is the unit-entity table, noted E. It has a size of (𝑛 × 𝑝) where 𝑝 is the number of narrative entities found in the text and cells 𝑒𝑖𝑗 indicates the presence, or the count for a weighted version, of entity 𝑗 in unit 𝑖. A narrative entity, in the context of this article, can be loosely defined in order to be flexible for various types of texts or analyses. It can roughly be seen as a recurring object with some importance in the narration. For example, it can be a location, an object, a character, a pair of characters (or even a triplet, a quadruplet, etc.), an oriented character interaction (e.g. a dialog), or even a particular recurring event containing multiple characters (e.g. a meeting). In this article, we mostly consider characters and pairs of characters as entities, as shown in Table 2. Note that in the present case, we consider that a character or a pair of characters are present in the unit if character names (or aliases) are detected above a fixed threshold. A weighted version of this table, where 𝑒𝑖𝑗 contains the number of occurrences of the entity 𝑗 in the unit 𝑖, is also possible. However, equations presented in this article are written for the presence/absence version. This data organization already gives an orientation to subsequent analyses and should be kept in mind by the practitioner. Textual units are now considered as individuals (in the statistical terminology), defined by their variables contained in the different columns of both tables. Moreover, subsequent analyses are oriented in searching how the unit-entity table E has an influence over the unit-word table N, i.e. searching which words are over-represented 1 https://github.com/gguex/char2char_vectors. 45 Guillaume Guex CEUR Workshop Proceedings 43–61 Table 2 A snippet of the unit-entity table E extracted from les Misérables. Rows are chapters, columns are characters (left) and character pairs (right), and cell 𝑒𝑖𝑗 denotes if 𝑗 appear in chapter 𝑖. Cosette Thénardier Valjean ... ... Cosette-Thénardier Cosette-Valjean ... 𝑢101 1 1 0 ... 𝑢101 ... 1 0 ... 𝑢102 1 1 0 ... 𝑢102 ... 1 0 ... 𝑢103 1 1 0 ... 𝑢103 ... 1 0 ... 𝑢104 1 0 1 ... 𝑢104 ... 0 1 ... or under-represented considering the entities within a specific unit. While an authors uses characters in order to build her/his narrative, we, to a certain extent, work backward: we are searching how character appearances and interactions in the textual unit act on her/his choice of words. If the extraction method permits it, a practitioner should include all entities which she/he desires to study. Here, for example, the choice to include character pairs along with characters is motivated by the fact that we are interested in studying character relationships. A character pair can roughly be seen as an interaction between two characters, and this interaction should be considered as an object of its own: the presence of this interaction in a unit does not result in having a mixture of words used for each character, but rather gives a specific flavor to the unit. This data organization also highlights the importance of choosing a proper size for the units. These units should be large enough to contain enough words in order to properly capture the textual specificity of each unit, but not too large, as each unit should ideally capture particularities about one of the entities. Unfortunately, it is impossible to define an ideal size for all types of analysis. This size should be balanced regarding the level of analysis, the text size, the selected entities, and previous knowledge of the studied work. The use of a contingency table N to represent the textual resource present in the units denotes a Bag-of-Words approach. Using this approach loses the information relative to the order of words in the units, but permits to transform a chain of characters, improper to statistical analyses, into a contingency table, a well studied mathematical object which allows the use of various kinds of computational methods. The next section shows a particular direction on how to use this table, with the help of embeddings. 2.2. Embedding of textual units Various methods can be performed on the contingency table N in order to extract information from it. Here, we make the choice to extract a lower dimensional, numeric representation of each unit, in other words, a textual unit vector located in an embedding space. In section 2.3, these vectors of textual units are used as anchor points in order to also embed entities into the same space. Therefore, it is crucial that an interpretation about the directions or the regions of this embedding space is possible, in order to properly interpret the localization of entity vectors (the relative position of entity vectors among themselves is generally insufficient). For that reason, we focus on embeddings of textual units which also contain vectors of words: by examining the positions of entities relatively to word vectors, entities can be depicted. We propose two embeddings verifying this condition: Section 2.2.1 describes Correspondence 46 Guillaume Guex CEUR Workshop Proceedings 43–61 Analysis (CA) and section 2.2.2 focuses on Pre-trained Word Vectors (WV). 2.2.1. Correspondence Analysis (CA) Using Correspondence Analysis (CA) in order to analyze textual resources has a long tradition [21]. It has the advantage to naturally provide an embedding space, the factorial map, where units are placed alongside word vectors, and allows the interpretation of the placement of units in terms of word frequency profiles. Units and word vectors in the embedding space have a direct interpretation in terms of chi2 distance between profiles. By performing a Correspondence Analysis on table N, we get 𝑛 vectors x1 , . . . , x𝑛 corre- sponding to units (rows) and 𝑣 vectors w1 , . . . , w𝑣 corresponding to words (columns). Each of these vectors has a size of min(𝑛, 𝑣) − 1, which will generally be 𝑛 − 1. For a detailed computation of quantities in CA, see Appendix A.1. An association score between a particular unit 𝑖 and a word 𝑗 is expressed through the scalar product between their vectors 𝑎𝑖𝑗 := x⊤ 𝑖 w𝑗 . (1) A positive (resp. negative) association score denotes an over-representation (resp. under- representation) of the word 𝑗 in 𝑖, which permits to find lists of words characterizing the different units. Note that in this article, this association score is rather computed between a word vector and an entity vector, since the latter, as we will see in section 2.3, lies in the same space as unit vectors. We could also track how units (or entities) are dissimilar to each others by using this time the Euclidean distance between vectors. Note that vectors x1 , . . . , x𝑛 obtained from CA reflect textual unit profile (in terms of words) regarding the mean profile (the origin in the factorial map). This analysis is thus contrastive: it highlights unit variations in the studied text. It means that the particular tone of the whole studied text might be hidden in this analysis and only the variation around this tone will be revealed. It might lead to the situation where the (absolute) feeling experienced by the reader will not appear in this analysis, e.g., a sad character in a sad book might appear joyful if he is less sad than the mean tone. This can become problematic when this method is used sequentially to study multiple works: particularities of each book will be hidden. Another limitation with this approach is that the words helping the interpretation of units (and entities) are contained in the studied text. Approaches requiring to study the position of units and entities relatively to a predefined list of words (e.g., friends, enemies, family) might therefore be impossible if these words do not appear in the text. 2.2.2. Pre-trained Word Vectors (WV) Pre-trained Word Vectors (WV), based on methods such as Word2Vec [22], GloVe [23], fastText [24], or Bert [25] have received great attention from various fields in the last decade. They are generally obtained through a training on a very large corpus, such as Wikipedia or Common Crawl, and the resulting embedding contains a large quantity of word vectors. As shown by multiple studies (see [26] for a survey), these vectors are placed in order to reflect semantic and syntactic relationships between words, and are used in various applications. We focus here on static word embeddings, where word vectors are fixed and do not depend on their 47 Guillaume Guex CEUR Workshop Proceedings 43–61 context, obtained by, e.g., fastText. The reason is that we need to have interpretable regions in an unchanging embedding space. There exist multiple methods which use pre-trained word vectors in order to derive vectors for a group of words, such as sentences [17, 27], paragraphs [28], or documents [29]. These derived vectors are often used to apply a classification or clustering algorithm on the newly embedded objects, or to query information [27, 29]. In order to derive these vectors, the majority of methods use frequencies of words found in objects, i.e. a table similar to N, but apply various weighting schemes and normalizations in order to reduce the effects of frequent words and to standardize vectors. In the present article, we use a methodology proposed in [27] as it is compatible with multiple unit sizes and gives good results in many tasks. Thus, textual units vectors x1 , . . . , x𝑛 are obtained through the table N and with the method detailed in Appendix A.2. An association score can again be computed between a unit (or an entity) vector x𝑖 and word vector w𝑗 through the cosine similarity, defined by x⊤ w𝑗 𝑎𝑖𝑗 := √︁ 𝑖 (2) x⊤ ⊤ 𝑖 x𝑖 w𝑗 w𝑗 Note that, with word vectors, this cosine similarity also permits to compare units (or entities) between themselves. With the pre-trained word vector method, the unit vectors x1 , . . . , x𝑛 (and entity vectors in section 2.3) lie in an absolute space defined by the pre-trained word vectors. Comparison between different texts is therefore more pertinent, and associations with words absent from the corpus can be made. However, it is possible that all units from a given text will be located in the same region of the space if the vocabulary used in it is very specific. In this case, the list of most associated word vectors might be similar for every unit, and the analysis will not give satisfying results. This effect is fortunately limited by the centration of unit vectors which occurs in the method described in Appendix A.2. 2.3. Entity embeddings The main goal of this article is not to analyze units, but rather entities, i.e., the 𝑝 columns of table E. While we use the table N to build embeddings of units, we utilize the table E in order to build the entity vectors y1 , . . . , y𝑝 relatively to unit vectors x1 , . . . , x𝑛 . Two propositions of methods are made: the centroids method (CENT), described in section 2.3.1; and the regressions method (REG), explained in section 2.3.2. Both methods can be combined with the embeddings of units defined in the previous section. 2.3.1. Centroids (CENT) This method is the most trivial and is based on the following intuition: an entity is characterized equally by all units in which it appears. In other words, we can define the vector y𝑘 for entity 𝑘 as 𝑛 ∑︁ y𝑘 = 𝑓𝑖 𝑒𝑖𝑘 x𝑖 (3) 𝑖=1 48 Guillaume Guex CEUR Workshop Proceedings 43–61 where 𝑓𝑖 = 𝑛𝑛∙∙ 𝑖∙ is the relative weight of unit 𝑖. y𝑘 indicates the center of mass, or centroid, of the units containing the entity. This way of building entity vectors is closely related to the treatment of supplementary variables found in CA: these variables do not act in the choice of factorial axes, but can still be represented afterward. However, by contrast, entity vectors are not dilated after computing centroids, which means that they lie in the same space as units (row). An important remark about the centroid method is that entity vectors positions are additive, i.e. we have ∑︁ ∑︁ 𝑒𝑖𝑘 = 𝑒𝑖𝑔 , ∀𝑖 =⇒ y𝑘 = y𝑔 , (4) 𝑔∈𝒢 𝑔∈𝒢 where 𝒢 is a subset of entities. This property can be interpreted as followed: if a character 𝑘 can be divided among different situations 𝑔 (the character alone, the character in interaction with another character, etc.), the character vector y𝑘 is in fact the sum of all vectors y𝑔 of these situations. This is not necessarily an undesirable property, but it implies that the specificities of the lone character might be hidden if he is often registered in an interaction. By contrast, if we consider that an interaction between two characters is an emerging situation, unrelated to prior behaviors of characters, the regressions method described in the next section seems more appropriate. 2.3.2. Regressions (REG) When building a regression model with multiple explanatory variables, it is possible to also include their interactions. By doing so, we suppose that the effect of raising both variables is not the same as raising each variable independently. Regression models seem therefore appropriate to capture specificities of having a particular entity in a textual unit. For example, in the case of character pairs, the presence of a character 𝑎 will have a effect on the vocabulary of an unit, the presence of another character 𝑏 will have another effect, and the presence of the pair {𝑎, 𝑏} yet a different effect. Now, dependent variables in regression models still need to be defined. In fact, we are doing 𝑑 regressions, with 𝑑 the number of dimensions of the embedding, and each regression is constructed to predict the 𝛼-th coordinate of units by using binary variables in the table E. In matrix notation, all regression models can be written as X = EB ̃︀ + Σ, (5) where X = (𝑥𝑖𝛼 ) is the (𝑛 × 𝑑) matrix containing unit vectors (on rows), E ̃︀ is the matrix E with a first additional column of 1 for the intercept, B = (𝛽𝑘𝛼 ) is the ((𝑝 + 1) × 𝑑) matrix containing intercepts and regression coefficients (each column corresponds to one regression), and Σ the (𝑛 × 𝛼) matrix containing normal errors. Intercepts and coefficients estimations B ̂︀ = (𝛽̂︀𝑘𝛼 ) can be considered as our embeddings for entities as well as for the intercept, which represents the general tone of the studied text. We therefore denote these estimates with Y = (𝑦𝑘𝛼 ) in the following, with the notation convention 𝑦0𝛼 for intercept coordinate 𝛼. As the number of entities (i.e. predictors) might be very large, it is a good idea to add a 𝐿2 regularization term in the objective function. Moreover, the quadratic error rate should also be 49 Guillaume Guex CEUR Workshop Proceedings 43–61 weighted by the number of tokens in each unit. Including all this, we find the solution for our intercept and entity vectors y0 , y1 , . . . , y𝑝 , contained in the rows of Y, with ̃︀ ⊤ Diag(f )E Y = (E ̃︀ + 𝜆I(𝑝+1) )−1 E ̃︀ ⊤ Diag(f )X, (6) where Diag(f ) is the diagonal matrix containing weights of units f = (𝑓𝑖 ), 𝜆 > 1 is the regularization coefficient, and I(𝑝+1) is the identity matrix of size ((𝑝 + 1) × (𝑝 + 1)). An interesting effect of the regularization coefficient is that if 𝜆 is high, equation (6) becomes Y ≈ 𝜆1 Ẽ︀ ⊤ Diag(f )X, which is similar to equation (3) with a contraction term 𝜆. In fact, the regressions method with a regularized term interpolates between the hypothesis where we suppose that every entity should be considered independently (with 𝜆 → 0), to the hypothesis of additive mixture between entities (with 𝜆 → ∞), as discussed in section 2.3.1. Choosing an appropriate 𝜆 according to the study (how is another, difficult question) might lead to a situation revealing desirable information about entities. 3. Case study : Les Misérables At the time of writing, it is not possible to evaluate the exposed framework with some kind of metric, which would allow to test its pertinence on various corpora. In order to see if the methods give coherent results, we have to carefully scrutinize and compare them with previous knowledge of the studied work. For this reason, and because of method variations and multiplicity of the results (and lack of place), we chose to present only one case study: the analysis of characters and relationships in Les Misérables, by Victor Hugo. The choice of this work is motivated by the fact that it is a large corpus, well-known, immensely studied, and containing various colorful characters and characters relationships. Therefore, it is a strong choice to clearly illustrate the potential of the exposed framework. 3.1. Preprocessing The five volumes of Les Misérables, in French, were extracted from Project Gutenberg 2 , while headers and footers of each file were manually removed. The whole text was lower cased, lemmatized, and stopwords3 and punctuation were removed. Volumes, books, and chapters breaking points were kept for later uses. We chose to use chapters as textual units. The table N (Figure 2.1) was built by considering words appearing at least 20 times in the text and resulted in a table of size 365 chapters × 1974 words. Characters were detected using Flair 4 NER tools [30]. In order to unify characters and to further refine the results, we used hand-made lists of character names and aliases from NER results. It resulted in the detection of 54 characters. The entities considered in table E (Figure 2.1) are composed of 54 single characters and 547 character pairs, resulting in a table of size 2 https://www.gutenberg.org/. 3 from a list made by Jacques Savoy http://members.unine.ch/jacques.savoy/clef/frenchST.txt. 4 https://github.com/flairNLP/flair. 50 Guillaume Guex CEUR Workshop Proceedings 43–61 365 × 601. A character (resp. a pair of characters) is considered present if it is (resp. both are) detected at least 2 times in the chapter. Note that, in section 3.3.3, we also tested experiments with entities consisting in characters and character pairs as found in each volume (e.g. Cosette-Valjean in volume one and Cosette- Valjean in volume two are now two different entities), with the addition of volume constants (𝑉𝑖 = 1 in volume 𝑖 and 𝑉𝑖 = 0 in other volumes) in order to isolate volume specific vocabulary. This new table Evol , containing 1124 entities, permits to see a diachronic evolution of words associated with volumes, characters, and character relationships. 3.2. Methods There are two types of methods for unit embeddings, CA (section 2.2.1) and WV (section 2.2.2), as well as two methods to derive entity embeddings from them, CENT (section 2.3.1) and REG (section 2.3.2), making a total of 4 possibles ways for obtaining entity embeddings. The CA method do not need any external data, and results in vectors in a 364-dimensional space, while the WV methods is based on pre-trained word vectors using fastText [24] trained on Common Crawl.5 For French, the number of word vectors is around two million and the dimension of the vector space is 300. Note that, in addition to having two tables E and Evol , four methods, and a considerable number of words and entities, results can also be presented in various ways (similarities between entities, associations between entities and words, etc.). Thus, we chose to show here a selection of results for the each method: the 5 most associated words regarding a subset of entities (section 3.3.1), the 5 most associated entities regarding a subset of words (section 3.3.2), and a diachronic study of the 5 most associated words for a subset of entities (section 3.3.3). We invite curious readers to consult results for all words and entities, which can be found in our GitHub repository.6 3.3. Results 3.3.1. The most associated words for a subset of entities The first result in this section presents the most associated words with a subset of entities, as measured by the association score defined in section 2.2. Results can be found in Table 3 for all methods. We can observe that CA methods seems to summarize entities with a vocabulary closer to the work, while WV methods tend to frequently use words with a wider scope, with notably more verbs. It results in having the WV methods giving a general feeling for the tone used for describing characters and relationships, while the CA methods can depict very specific objects, locations or events associated with these entities. This behavior can be understood by the nature of unit embeddings: in the WV embedding, word vectors are fixed and do not take into account the actual frequencies of words found in the studied corpora. A character can be close to a word appearing only a few times (or none) in the corpus if this word is located near 5 https://fasttext.cc/docs/en/crawl-vectors.html. 6 in the "results" folder in https://github.com/gguex/char2char_vectors. 51 Guillaume Guex CEUR Workshop Proceedings 43–61 Table 3 The 5 most associated words (association score in parentheses) to a selected set of entities, regarding CA-CENT, CA-REG, WV-CENT, and WV-REG methods. Words appearing at least two times within the same method are in bold. Cosette Cosette-Marius Cosette-Valjean Marius Valjean poupée (0.7) noce (1.72) noce (1.0) théodule (0.61) mestienne (0.51) noce (0.68) mariage (1.31) mestienne (0.97) jondrette (0.59) fossoyeur (0.46) mestienne (0.58) marié (1.21) mariage (0.71) ursule (0.56) accusé (0.45) mariage (0.48) marier (1.11) marié (0.68) vernon (0.53) maire (0.39) CA-CENT marié (0.48) baron (1.0) corbillard (0.65) tante (0.52) jean (0.37) Marius-Valjean Javert Javert-Valjean Myriel Myriel-Valjean noce (1.2) accusé (1.47) accusé (1.85) conventionnel (5.03) chandelier (6.28) mariage (0.85) arras (1.04) avocat (1.12) évêque (3.54) gendarme (5.06) ursule (0.85) mouchard (0.97) preuve (1.1) oratoire (3.39) panier (4.72) marié (0.8) avocat (0.96) président (1.08) hôpital (2.57) couvert (4.64) tableau (0.74) preuve (0.93) forçat (1.01) cathédrale (2.54) deuil (4.52) Cosette Cosette-Marius Cosette-Valjean Marius Valjean seau (1.23) amant (0.83) blessure (0.78) jondrette (1.76) matelas (1.02) poupée (0.86) mariage (0.73) noce (0.76) réchaud (1.26) chandelier (0.87) ravissant (0.7) entraîner (0.7) file (0.6) galetas (1.11) toulon (0.82) source (0.65) noce (0.67) corbillard (0.58) bouge (1.05) fossoyeur (0.79) CA-REG rassurer (0.61) volupté (0.63) mestienne (0.58) tableau (0.93) pelle (0.76) Marius-Valjean Javert Javert-Valjean Myriel Myriel-Valjean égout (1.1) arras (1.09) accusé (1.04) conventionnel (2.99) deuil (1.14) vase (1.08) roue (0.89) nier (0.79) évêque (1.76) chandelier (1.07) issue (1.07) bonjour (0.83) quai (0.54) cathédrale (1.14) aveugle (1.01) sable (1.0) malle (0.8) avocat (0.53) prêtre (1.11) panier (0.94) couloir (0.98) cabriolet (0.76) fonction (0.5) philosophie (1.06) gendarme (0.89) Cosette Cosette-Marius Cosette-Valjean Marius Valjean jean (0.34) aimer (0.38) jean (0.6) embrasser (0.36) jean (0.56) dormir (0.28) rêver (0.34) jacques (0.3) essayer (0.36) habiller (0.27) regarder (0.26) vouloir (0.32) philippe (0.26) avouer (0.36) poser (0.26) WV-CENT habiller (0.26) douter (0.32) habiller (0.26) vouloir (0.35) jacques (0.26) voir (0.25) avouer (0.32) pantalon (0.25) voir (0.35) pantalon (0.25) Marius-Valjean Javert Javert-Valjean Myriel Myriel-Valjean jean (0.35) saisir (0.34) jean (0.54) évêque (0.59) évêque (0.55) questionner (0.31) jean (0.34) denis (0.31) archevêque (0.52) archevêque (0.46) essayer (0.31) placer (0.31) jacques (0.3) prêtre (0.45) prêtre (0.42) oser (0.31) retirer (0.29) saisir (0.3) abbé (0.39) âme (0.42) poser (0.29) dégager (0.29) philippe (0.28) souverain (0.38) abbé (0.39) Cosette Cosette-Marius Cosette-Valjean Marius Valjean contempler (0.29) éternel (0.35) rue (0.44) regarder (0.38) jean (0.56) emplir (0.29) amour (0.35) jean (0.41) voir (0.36) pantalon (0.28) doucement (0.27) humanité (0.34) faubourg (0.41) refermer (0.34) jacques (0.26) envelopper (0.26) âme (0.32) boulevard (0.41) glisser (0.34) philippe (0.23) WV-REG illuminer (0.26) vérité (0.32) quartier (0.34) poser (0.31) glisser (0.23) Marius-Valjean Javert Javert-Valjean Myriel Myriel-Valjean rue (0.35) serrer (0.34) rue (0.35) évêque (0.43) ange (0.37) boulevard (0.35) glisser (0.34) boulevard (0.34) divin (0.4) évêque (0.31) souterrain (0.35) forcer (0.34) autorité (0.33) humble (0.39) âme (0.31) bastille (0.35) bouger (0.33) civil (0.33) bonté (0.38) amour (0.29) carrefour (0.34) aller (0.32) loi (0.33) archevêque (0.37) aurore (0.28) 52 Guillaume Guex CEUR Workshop Proceedings 43–61 the vocabulary associated with this character, as semantically similar words are located in the same region of space. By contrast, CA will generally takes into account word frequencies along with specificities in order to describe an entity, and semantically similar words can be located far away from each other. Another remark can be made about the difference between CENT methods and REG meth- ods. As expected, we see that the CENT methods reveal their additive construction between characters and relationships: words used to describe a relationship rob off on their character descriptions (see e.g. Cosette, Cosette-Marius, and Cosette-Valjean). By contrast, the REG methods display more "perpendicular" descriptions of entities, with fewer words repeating. Note that we did not show here the least associated words with each entity, as they are fre- quently the same for all methods and all related to the long description of the Battle of Waterloo in volume 2 ("infanterie", "wellington", "cuirassier", "bridage"), containing no protagonist of the story. Overall, we find that the CA-REG method provides the most satisfying results, with pertinent words associated with each entity and a high variety in the choice of words. 3.3.2. The most associated entities for a subset of words Table 4 The 5 most associated entities (association score in parentheses) to a selected set of words, regarding CA-CENT, CA-REG, WV-CENT, and WV-REG methods. aimer rue justice guerre Dahlia-Fameuil (1.12) Courfeyrac-Fauchelevent (0.79) Azelma-Babet (1.09) Combeferre-Fauchelevent (1.35) CA-CENT Dahlia-Listolier (1.12) Courfeyrac-Toussaint (0.79) Azelma-Brujon (1.09) Feuilly-Valjean (1.27) Fameuil-Zéphine (1.12) Eponine-Fauchelevent (0.79) Azelma-Claquesous (1.09) Feuilly-Marius (1.22) Listolier-Zéphine (1.12) Eponine-Gavroche (0.79) Azelma-Magnon (1.09) Lesgle-Valjean (1.15) Gillenormand-Toussaint (1.11) Eponine-Pontmercy (0.79) Azelma-Montparnasse (1.09) Mabeuf-Valjean (1.15) Cosette-Marius (0.35) Courfeyrac (0.37) Javert-Valjean (0.37) Grantaire-Pontmercy (0.94) CA-REG Myriel (0.31) Grantaire-Prouvaire (0.29) Champmathieu-Valjean (0.37) Grantaire (0.56) Basque-Fauchelevent (0.25) Cosette-Javert (0.26) Myriel (0.21) Marius-Pontmercy (0.55) Myriel-Valjean (0.22) Marius-Prouvaire (0.22) Grantaire (0.18) Pontmercy (0.55) Fauchelevent-Gillenormand (0.21) Enjolras (0.22) Grantaire-Javert (0.16) Enjolras (0.49) Cosette-Marius (0.38) Grantaire-Prouvaire (0.57) Azelma-Babet (0.34) Grantaire (0.32) WV-CENT Fantine-Marius (0.34) Marius-Prouvaire (0.54) Azelma-Brujon (0.34) Combeferre-Lesgle (0.32) Fantine-Pontmercy (0.34) Cosette-Javert (0.43) Azelma-Claquesous (0.34) Feuilly-Lesgle (0.25) Basque-Fauchelevent (0.34) Magnon-Monsieur Thénardier (0.37) Azelma-Magnon (0.34) Combeferre-Marius (0.25) Prouvaire-Valjean (0.33) Gavroche (0.36) Azelma-Montparnasse (0.34) Combeferre-Grantaire (0.25) Prouvaire-Valjean (0.34) Grantaire-Prouvaire (0.8) Champmathieu-Valjean (0.38) Grantaire-Pontmercy (0.39) WV-REG Champmathieu-Chenildieu (0.31) Marius-Prouvaire (0.77) Azelma-Brujon (0.35) Enjolras-Marius (0.35) Brevet-Chenildieu (0.31) Courfeyrac (0.66) Azelma-Claquesous (0.35) Grantaire (0.31) Brevet-Cochepaille (0.31) Cosette-Javert (0.64) Azelma-Magnon (0.35) Combeferre-Lesgle (0.31) Champmathieu-Cochepaille (0.31) Prouvaire (0.63) Azelma-Montparnasse (0.35) Cosette-Gavroche (0.27) These results are extracted from a transposed table, and display the most associated entities to a selected set of words. They can be found in Table 4. This type of results can be seen as queries, made from a single word by a practitioner, which output the most associated entities in the work related to that query. We chose here to show top entities related to words "aimer", "rue", "justice" and "guerre", as they represent some of the main topics of the book. In this task again, from our point of view, the CA-REG displays the most accurate results: the main love relationship (Cosette-Marius) of the book is the most associated entity for "aimer", several "amis de l’ABC" (a revolutionary group) are most associated with "rue", the cop-suspect relationship 53 Guillaume Guex CEUR Workshop Proceedings 43–61 Figure 1: Resulting weighted and signed networks between main characters, with examples of word queries ("aimer", "rue", "justice", and "guerre"). These networks are computed with CA-REG method (𝜆 = 0.01). Red indicates positive affinity, blue negative affinity, and edge width is proportional to the number of detected interactions between characters. (Javert-Valjean) is the top entity for "justice", and military officers or bellicose characters are associated with "guerre". While somewhat inferior with the selected set of queries, WV methods have the advantage of being able to query words outside the scope of the book, as the pre-trained word embedding possesses a very large vocabulary. Note that another way to display these results is through weighted signed networks, as found in Figure 1 (for CA-REG). The network structure represents the number of times characters are detected together (which do not depend on the query), and the signed weights (edge color) display association score between character relationship (edges) and the queried word. This representation gives a quick visual support in order to explore the studied work and could be implemented as a standalone program. 3.3.3. A diachronic study of the most associated words for a subset of entities These results are obtained from the table Evol where entities are considered different based on the volume. By doing so, it permits to track the evolution of association scores along the book. Additionally to entities, we can also define a constant term 𝑉𝑖 for each volume 𝑖, which absorbs the associated words with each volume. Results for constants and a subset of entities (Valjean, Cosette, Cosette-Valjean) can be found in Table 5. Note that we did not show CENT results in this table, as they are similar to the one found in Table 3: words are often repeated for different 54 Guillaume Guex CEUR Workshop Proceedings 43–61 Table 5 The 5 most associated words (association score in parentheses) vs volumes constant, Valjen, Cosette, and Cosette-Valjean, as found in each volume regarding the CA-REG and WV-REG methods (𝜆 = 0.01). 𝑉1 𝑉2 𝑉3 𝑉4 𝑉5 huissier (1.4) cuirassier (3.38) gamin (1.92) émeute (1.48) sable (3.04) hôte (1.11) infanterie (2.9) mine (1.38) révolte (0.86) berge (2.32) arras (0.92) sacrement (2.69) farce (0.81) bourgeoisie (0.84) égout (2.16) lampe (0.91) brigade (2.41) ignorance (0.74) populaire (0.82) voûte (1.9) montreuil (0.81) division (2.4) jondrette (0.7) insurrection (0.81) vase (1.89) Valjean 1 Valjean 2 Valjean 3 Valjean 4 Valjean 5 chandelier (1.03) pelle (2.6) ursule (1.49) réverbère (0.98) matelas (2.54) toulon (0.8) fossoyeur (2.35) luxembourg (1.06) hausser (0.45) ronde (1.96) gervai (0.73) pioche (1.51) tableau (0.93) promenade (0.37) galerie (1.06) bagne (0.65) carte (1.39) banc (0.81) lanterne (0.36) lanterne (0.99) CA-REG maire (0.57) mestienne (1.35) mouchoir (0.77) tuyau (0.35) rive (0.99) Cosette 1 Cosette 2 Cosette 3 Cosette 4 Cosette 5 gargote (0.59) seau (1.48) - ravissant (1.11) encre (0.76) balayer (0.57) poupée (0.99) - céleste (0.76) plume (0.59) alouette (0.56) source (0.84) - volupté (0.67) noce (0.49) servante (0.43) gargote (0.71) - frémir (0.64) chandelier (0.48) mois (0.34) mestienne (0.64) - lancier (0.6) antichambre (0.47) Cosette-Valjean 1 Cosette-Valjean 2 Cosette-Valjean 3 Cosette-Valjean 4 Cosette-Valjean 5 maladie (0.49) façade (0.68) - promenade (0.5) noce (1.27) médecin (0.48) corbillard (0.66) - chaîne (0.47) marié (0.93) demain (0.33) mestienne (0.6) - blessure (0.46) mardi (0.89) surprise (0.28) bâtiment (0.56) - tuyau (0.45) mariage (0.86) auprès (0.28) cul (0.55) - luxembourg (0.44) file (0.65) 𝑉1 𝑉2 𝑉3 𝑉4 𝑉5 demander (0.36) saint (0.39) gamin (0.45) violence (0.44) égout (0.52) décider (0.3) mont (0.39) garçon (0.42) haine (0.42) quai (0.45) aider (0.3) régiment (0.38) jeune (0.36) révolte (0.42) rue (0.44) expliquer (0.29) chapelle (0.38) enfant (0.35) souffrance (0.4) eau (0.42) plaindre (0.29) infanterie (0.36) père (0.34) étincelle (0.39) chaussée (0.42) Valjean 1 Valjean 2 Valjean 3 Valjean 4 Valjean 5 essayer (0.29) jean (0.54) admirer (0.32) jean (0.71) jean (0.72) réfléchir (0.27) jacques (0.33) passer (0.31) jacques (0.41) pantalon (0.42) expliquer (0.24) pantalon (0.29) observer (0.3) pantalon (0.36) jacques (0.39) agir (0.24) mr (0.28) guetter (0.3) louis (0.34) philippe (0.34) WV-REG questionner (0.24) denis (0.28) croiser (0.3) philippe (0.33) denis (0.33) Cosette 1 Cosette 2 Cosette 3 Cosette 4 Cosette 5 an (0.41) dormir (0.33) - rêver (0.32) rêver (0.31) mois (0.39) regarder (0.29) - regarder (0.31) mentir (0.29) mère (0.36) sentir (0.28) - contempler (0.28) écrire (0.29) fille (0.35) endormir (0.28) - pleurer (0.27) demander (0.28) enfant (0.33) respirer (0.28) - lire (0.27) pleurer (0.28) Cosette-Valjean 1 Cosette-Valjean 2 Cosette-Valjean 3 Cosette-Valjean 4 Cosette-Valjean 5 voir (0.32) rue (0.55) - jean (0.52) mariage (0.42) entendre (0.31) ruelle (0.48) - pantalon (0.35) marié (0.4) frissonner (0.3) boulevard (0.45) - gilet (0.28) noce (0.4) grommeler (0.3) mur (0.4) - gris (0.27) gai (0.33) essayer (0.3) faubourg (0.4) - manteau (0.26) amour (0.3) 55 Guillaume Guex CEUR Workshop Proceedings 43–61 entities and are less convincing. Here again, we see that associated words for the WV give the general tone of volumes and entities, while CA results are more specific and related to particular events which occurred for characters. As expected, words associated with volume constants give a short overview of each volumes, especially with the CA-REG method (e.g. 𝑉2 for the Battle of Waterloo, 𝑉4 for the barricade event). Associated words with entities also seem accurate in describing them. Note that Cosette was not detected in volume 3 because she is not explicitly cited (she is often referred as "the daughter of M. Leblanc"), and this also explains the absence of the Cosette-Valjean pair. 4. Conclusion In this article, we introduced a general framework in order to automatically extract textual information about narrative entities from a small corpus or a single work. The framework is built on two tables, the unit-word table N and the unit-entity table E. This data organization sets subsequent analyses into a classical statistical framework, where the goal is to see how variables in E (the entities) affect the variables in N (the vocabulary) for each textual unit. A choice was taken to use embeddings for analyzing these effects: units and words are embedded using Correspondence Analysis or pre-trained Word Embedding on N, and entities are embedded in the same space as units using the Centroids or the Regressions methods on E. These embeddings are then used in order to see affinities between entities and words, enabling the characterization of the former by the latter. A case study on Les Misérables was performed to see if methods gave promising results. The first important choice in the analysis is how to define the size of units. Other corpora were also tested (e.g. Shakespeare plays) and it seems important to define units with at least a paragraph size (after preprocessing) in order to represent them accurately. Choosing small units might allow to successfully capture word specificities related to a small subset of entities, but unit vectors become almost orthogonal one to another if the size of units is too small. This situation results in an overfitting regime with a high variance and low bias, i.e. units positions can be distant with the difference of only a few rare words. By contrast, large units will result in an underfitting regime, with a low variance and high bias, failing to capture entity specificities, but more robust to particularities in word usages. Having enough units is also important in order to properly locate entities in the embedding space. In order to analyze characters and relationships, we advise the practitioner to use their prior knowledge of the work in order to split the studied narrative as close as possible to "scenes" (as found in theater), which describe a particular event between an almost constant set of characters. The second choice is to select which entities to study. This choice is of course driven by the problematic, but is also limited by the automatic extraction tools available. These entities can be various, but must appear frequently in the work in order to be placed correctly in the embedding space. However, it is unadvised to set an entity which is almost always there (e.g. a narrator), as it will already be represented by the origin in the CENT method or as the constant term in the REG method. As a rule of thumbs, the number of entities should ideally be lower than the number of textual units. However, even with an exceeding number of entities (like in our case study, where we had 601 and 1124 entities for 365 units), if some entities appear rarely, 56 Guillaume Guex CEUR Workshop Proceedings 43–61 analyses are still possible. Note that the version of the table E containing counts of entities rather than presence among each unit was also tested in experiments, but gave similar results for the studied corpus. The choice of using embeddings, where units, entities and words are located, is motivated by the fact that the resulting space permits many types of explorations. As presented in this article, we can extract some of the most (or least) associated words with each entity or rank entities according to a word query, but other types of measurements could also be made. Entities could be placed along a particular axis in the space, defined with two sets of contrasted words, in order to highlight a particular opposition (positive-negative, in order to do sentiment analysis, introvert-extrovert, friend-enemy, etc.). This approach could also be combined with a clustering of the words, or a Topic Modeling method, thus permitting to further refine the different regions in the embedding space. Relative location of entities could also be used in order to cluster or classify them. All these leads can be explored in future research. The difference in the choice between CA and WV embeddings appears quite clearly in the results. CA highlights particular words associated with entities, very specific to the studied work and the narrative events found in it, while WV gives a general feeling of the tone of the text when these entities are present. This difference is explained by the fact that CA focuses on words appearing within the work, with possibly very different locations to semantically similar words, while WV word vectors are positioned regarding their semantic and syntactic similarities. An entity located in the WV space will then be in a semantic or syntactic region, and its characterizing words should all be related. Results show that CA methods generally perform better to quickly interpret entities among the narrative, but might be limited for some applications. As a matter of fact, the advantage of WV embeddings is that its space is absolute, permitting the comparison of results between sequentially studied texts, and also contains a larger vocabulary in its embedding. This last property could be used in order to use a fixed list of relationship attributes (e.g., friend, enemy, family, colleague), which do not necessarily appear in every text in the studied corpus, in order to categorize character relationships. The choice between the CENT method and the REG method is relatively easy: thanks to its hyperparameter 𝜆, the REG method can give similar results than the CENT method when 𝜆 is high (with only a contraction of entity vectors), but also gives more "perpendicular" sets of words describing entities when 𝜆 is low. Thus, it is clearly a superior choice in order to give a variety of results. The choice for this hyperparameter 𝜆 depends on what the practitioner desires. If her/his entities are defined such that some of them are completely included into others, such as character and character pair, and she/he would like to have specificities about the finer grained entity, the 𝜆 must be set to a low value. By contrast, if she/he do not mind in having some of her/his entities described like a mixture of others, she/he can set 𝜆 to a high value. However, very low values of 𝜆 should be avoided if the number of entities is high compared to the number of units, as this will lead to an overfitting of regression coefficients and result in the association of very rare and specific words with entities. Finally, the biggest weakness of this framework yet is the difficulty to validate its pertinence. Several other case studies, with results carefully scrutinized regarding prior knowledge, should be undertaken in order to see if results are trustworthy, but this type of experiments are expensive both in time and human resources. Another idea could be to use annotated corpora such as described in [31], where human annotators classified character relationships in various 57 Guillaume Guex CEUR Workshop Proceedings 43–61 categories. For example, we could see if the presented method can actually retrieve these categories by assigning relationships to the category-word with the highest association score. Such experiments are promising, but it requires an efficient automatic entity tagger, in order to detect and especially unify characters in a large quantity of documents, and unfortunately, this tool does not exist yet. Nevertheless, first case studies gave promising results for this framework, and its flexibility could lead to various applications. A. Appendix A.1. Correspondence Analysis Starting from the (𝑛 × 𝑣) contingency table N = (𝑛𝑖𝑗 ), we define the vector of unit weights as f = (𝑓𝑖 ) := (𝑛𝑖∙ /𝑛∙∙ ) and the vector of word weights as g = (𝑔𝑗 ) := (𝑛∙𝑗 /𝑛∙∙ ), where ∙ denotes the summation on the replaced index. It is then possible to compute the weighted scalar product matrix between units K = (𝑘𝑖𝑗 ) with 𝑣 ∑︁ (7) √︀ 𝑘𝑖𝑗 := 𝑓𝑖 𝑓𝑗 𝑔𝑘 (𝑞𝑖𝑘 − 1)(𝑞𝑗𝑘 − 1), 𝑘=1 where 𝑞𝑖𝑘 = 𝑛𝑗∙ 𝑛∙𝑘 is the quotient of independence of the cell 𝑖, 𝑘. The vector of textual unit 𝑛𝑖𝑘 𝑛∙∙ 𝑖, x𝑖 = (𝑥𝑖𝛼 ), is obtained by the eigendecomposition of the matrix K = UΛU⊤ and with √ 𝜆𝛼 𝑥𝑖𝛼 := √ 𝑢𝑖𝛼 , (8) 𝑓𝑖 where 𝜆𝛼 are the eigenvalues contained in the diagonal matrix Λ and 𝑢𝑖𝛼 the eigenvectors components found in U. We find the vector of word 𝑗, w𝑗 = (𝑤𝑗𝛼 ), with 𝑣 1 ∑︁ 𝑤𝑗𝛼 := √ 𝑓𝑖 𝑞𝑖𝑗 𝑥𝑖𝛼 . (9) 𝜆𝛼 𝑖=1 Note that various other quantities of interest can also be computed in CA, such as 𝜆𝛼 𝑝𝛼 := : the proportion of inertia expressed in 𝛼, 𝜆∙ 𝑓𝑖 𝑥2𝑖𝛼 𝑐𝑢𝑖𝛼 := : the contribution of unit 𝑖 to axis 𝛼, 𝜆𝛼 𝑔𝑗 𝑤𝑗𝛼2 𝑐𝑤 𝑗𝛼 := : the contribution of word 𝑗 to axis 𝛼, 𝜆𝛼 𝑥2 ℎ𝑢𝑖𝛼 := ∑︀ 𝑖𝛼 2 : the contribution of axis 𝛼 to unit 𝑖, 𝛼 𝑥𝑖𝛼 𝑤𝑗𝛼2 ℎ𝑤 𝑗𝛼 := ∑︀ 2 : the contribution of axis 𝛼 to word 𝑗, 𝛼 𝑤𝑗𝛼 For a detailed interpretation of these different quantities, see [21]. 58 Guillaume Guex CEUR Workshop Proceedings 43–61 A.2. Unit embedding based of pre-trained word vectors This method is justified and detailed in [27]. Let w1 , . . . , w𝑣 be pre-trained word vectors which appear in the studied corpus, and the (𝑛 × 𝑣) table N counting the frequency of these words in the 𝑛 textual units. We first construct the uncentered vectors x ̃︀𝑖 of each unit 𝑖 with 𝑣 ∑︁ 𝑛𝑖𝑗 𝑎 x ̃︀𝑖 = w𝑗 , (10) 𝑛 𝑎 + 𝑛𝑛∙∙ ∙𝑗 𝑗=1 𝑖∙ where 𝑎 > 0 is an hyperparameter which gives less importance to frequent words as 𝑎 → 0. In this article, we set 𝑎 to the recommended value of 0.01. Let X ̃︀ be the matrix whose columns are vectors x̃︀𝑖 , and u be its first singular vector. We compute vectors x𝑖 of each units 𝑖 with ̃︀𝑖 − uu⊤ x x𝑖 = x ̃︀𝑖 . (11) This last equation acts like a centration of unit vectors in the direction of the first singular vector u. References [1] F. Moretti, “Operationalizing”: or, the Function of Measurement in Modern Literary Theory, The Journal of English Language and Literature 60 (2014) 3–19. [2] T. Underwood, A Genealogy of Distant Reading., DHQ: Digital Humanities Quarterly 11 (2017). [3] M. P. Eve, Close Reading with Computers: Genre Signals, Parts of Speech, and David Mitchell’s Cloud Atlas, SubStance 46 (2017) 76–104. [4] W. Schmid, Narratology: an introduction, Walter de Gruyter, 2010. [5] A. Agarwal, A. Kotalwar, J. Zheng, O. Rambow, SINNET: Social Interaction Network Extractor from Text, in: The Companion Volume of the Proceedings of IJCNLP 2013: System Demonstrations, Asian Federation of Natural Language Processing, Nagoya, Japan, 2013, pp. 33–36. URL: https://aclanthology.org/I13-2009. [6] S. Chaturvedi, M. Iyyer, H. Daume III, Unsupervised learning of evolving relationships between literary characters, in: Proceedings of the AAAI Conference on Artificial Intelli- gence, volume 31, 2017. [7] J. Li, A. Sun, J. Han, C. Li, A Survey on Deep Learning for Named Entity Recognition, IEEE Transactions on Knowledge and Data Engineering 34 (2022) 50–70. doi:10.1109/ tkde.2020.2981314. [8] V. Labatut, X. Bost, Extraction and Analysis of Fictional Character Networks: A Sur- vey, ACM Computing Surveys 52 (2019) 89:1–89:40. URL: https://doi.org/10.1145/3344548. doi:10.1145/3344548. [9] S. Min, J. Park, Modeling narrative structure and dynamics with networks, sentiment analysis, and topic modeling, PLOS ONE 14 (2019) e0226025. doi:10.1371/journal. pone.0226025. [10] I. Novakova, D. Siepmann, Literary Style, Corpus Stylistic, and Lexico-Grammatical Narrative Patterns: Toward the Concept of Literary Motifs, in: Phraseology and Style in 59 Guillaume Guex CEUR Workshop Proceedings 43–61 Subgenres of the Novel, Springer International Publishing, 2019, pp. 1–15. doi:10.1007/ 978-3-030-23744-8_1. [11] S. Grayson, M. Mulvany, K. Wade, G. Meaney, D. Greene, Novel2vec: Characterising 19th century fiction via word embeddings, in: 24th Irish Conference on Artificial Intelligence and Cognitive Science, 2016. [12] R. J. Heuser, Word vectors in the eighteenth century, in: ADHO 2017-Montréal, 2017. [13] S. J. Kerr, When Computer Science Met Austen and Edgeworth, NPPSH Reflections 1 (2017) 38–52. [14] M. Elsner, Character-based kernels for novelistic plot structure, in: Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, Association for Computational Linguistics, Avignon, France, 2012, pp. 634–644. URL: https://aclanthology.org/E12-1065. [15] J. Lee, C. Y. Yeung, Extracting Networks of People and Places from Literary Texts, in: Pro- ceedings of the 26th Pacific Asia Conference on Language, Information, and Computation, Faculty of Computer Science, Universitas Indonesia, Bali, Indonesia, 2012, pp. 209–218. URL: https://aclanthology.org/Y12-1022. [16] Y. Rochat, F. Kaplan, Analyse des réseaux de personnages dans Les Confessions de Jean-Jacques Rousseau, Les Cahiers du numérique 10 (2014) 109–133. URL: https://www. cairn.info/revue-les-cahiers-du-numerique-2014-3-page-109.htm. doi:10.3166/LCN.10. 3.109-133, place: Cachan Publisher: Lavoisier. [17] A. Grener, M. Luczak-Roesch, E. Fenton, T. Goldfinch, Towards A Computational Literary Science: A Computational Approach To Dickens’ Dynamic Character Networks (2017). doi:10.5281/ZENODO.259499. [18] G. A. Sack, Character networks for narrative generation: Structural balance theory and the emergence of proto-narratives, Complexity and the human experience: Modeling complexity in the humanities and social sciences (2014) 81–104. [19] V. Krishnan, J. Eisenstein, "You’re Mr. Lebowski, I’m the Dude": Inducing Address Term Formality in Signed Social Networks, in: Proceedings of the 2015 Confer- ence of the North American Chapter of the Association for Computational Linguis- tics: Human Language Technologies, Association for Computational Linguistics, 2015. doi:10.3115/v1/n15-1185. [20] F. Incitti, F. Urli, L. Snidaro, Beyond word embeddings: A survey, Information Fusion 89 (2023) 418–436. doi:10.1016/j.inffus.2022.08.024. [21] L. Lebart, B. Pincemin, C. Poudat, Analyse des données textuelles, number 11 in Mesure et évaluation, Presses de l’Université du Québec, Québec, 2019. [22] T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient Estimation of Word Representations in Vector Space, arXiv:1301.3781 [cs] (2013). URL: http://arxiv.org/abs/1301.3781, arXiv: 1301.3781. [23] J. Pennington, R. Socher, C. Manning, Glove: Global Vectors for Word Representation, in: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Doha, Qatar, 2014, pp. 1532–1543. URL: http://aclweb.org/anthology/D14-1162. doi:10.3115/v1/D14-1162. [24] P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information, Transactions of the Association for Computational Linguistics 5 (2017) 60 Guillaume Guex CEUR Workshop Proceedings 43–61 135–146. URL: https://direct.mit.edu/tacl/article/43387. doi:10.1162/tacl_a_00051. [25] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018). [26] Y. Li, T. Yang, Word Embedding for Understanding Natural Language: A Survey, in: Studies in Big Data, Springer International Publishing, 2017, pp. 83–104. doi:10.1007/ 978-3-319-53817-4_4. [27] S. Arora, Y. Liang, T. Ma, A simple but tough-to-beat baseline for sentence embeddings, in: International conference on learning representations, 2017. [28] Q. Le, T. Mikolov, Distributed representations of sentences and documents, in: International conference on machine learning, PMLR, 2014, pp. 1188–1196. [29] M. Kusner, Y. Sun, N. Kolkin, K. Weinberger, From word embeddings to document distances, in: International conference on machine learning, PMLR, 2015, pp. 957–966. [30] A. Akbik, D. Blythe, R. Vollgraf, Contextual String Embeddings for Sequence Labeling, in: COLING 2018, 27th International Conference on Computational Linguistics, 2018, pp. 1638–1649. [31] P. Massey, P. Xia, D. Bamman, N. A. Smith, Annotating Character Relationships in Literary Texts (2015). arXiv:1512.00728. 61