1. Introduction: love and war in medieval French narratives

Make Love or War? Monitoring the Thematic Evolution of Medieval French Narratives

Jean-Baptiste Camps

NicolasBaumard

Pierre-CarLlanglais

OlivierMorin

ThibaultClérice

Jade Norind r

1 0 ALMAnaCH - Inria , 2 Rue Simone IFF, 75012 Paris 1 École nationale des chartes - Université PSL , 65 rue de Richelieu, Paris, 75012 , France 2 OpSci , 3 rue de Milan, Paris, 75009 , France

734 756

In this paper, we test a famous conjecture in literary history put forward by Seignobos and de Rougemont according to which the French central medieval period (12-13th centuries) is characterized by an important increase in the cultural importance of love. To do that, we focus on the large and culturally important body of manuscripts containing medieval French long narrative 昀椀ctions, in particular epics (chansons de geste, of the Matter of France) and romances (chie昀氀yromans on the Matters of Britain and of Rome), both in verse and in prose, from the 12th to the 15th century. We introduce the largest available corpus of these texts, thCeorpus of Medieval French Epics and Romances, composed of digitised manuscripts drawn fromGallica, and processed through layout analysis and handwritten text recognition. We then use semantic representations based on embeddings to monitor the place given to love and violence in this corpus, through time. We observe that themes (such as the relation between love and death) and emblematic works well identi昀椀ed by literary history do indeed play a central part in the representation of love in the corpus, but our modelling also points to the characteristic nature of more overlooked works. Variation in time seems to show that there is indeed an phase of expansion of love in these 昀椀ctions, in the 13th and early 14th century, followed by a period of contraction, that seem to correlate with the Crisis of the Late Middle Ages.

eol>Medieval French Literature Cultural Evolution History of Emotions Document Analysis and Recognition HTR Word and Document Embedding

1. Introduction: love and war in medieval French narratives 1.1. Love, a ‘medieval invention’…

Love – or more precisely love in literature – is sometimes depicted as “a medieval invention”, or rather, to quote the exact formulation of this phrase that goes back to the historian Charles Seignobos, “Love dates from the 12th century”35[]1. What Seignobos meant is not that there was in Antiquity no conception of love, but he di昀erentiates the antique notion Eorfos, interpreted as sexualdesire, at least for males (he admits the idea of love-as-respectful-devotion in Antique women) from the modern (and in his mind, Western only) notion of reciprocal love, that he de昀椀nes as “a new feeling of respect and reciprocal admiration, supposing equality between the two sexes” [ 35 ]. This conception would 昀椀nd its origins in the 12th century “courtly love”. The cultural movement of courtly love, 昀椀nth’aemor , started in Southern France in the lyrical poetry of thtreoubadours around the start of the 12th century as far as we can judge, and knew a spectacular expansion spanning Western Europe and fecundating other types of literary productions, such as the new form of throemans (romance), and even epic narrative forms, until then more preoccupied with violence, lineage and feudal values, such acshathnesons de geste, eventually blurring the frontier between the two genres.

In his masterwork,L’Amour et l’Occident (Love and the Western World) [ 33 ], Denis de Rougemont makes of the myth of Tristan et Iseut the archetype and the most emblematic early representative of the love-as-passion in Western culture, of which some signi昀椀cant features are the link between love and death, the unsatis昀椀ed, frustrated or fatal issue of a yet reciprocal sentiment, the transgression of moral norms or social duties ainnd昀椀n,e , the adulterous nature of the relationship.

In addition, de Rougemont also draws a correspondence between two apparently antagonistic themes: love and war. Noting the use of military vocabulary in the depiction of the conquest of the loved lady, to which the lover must lay siege, a昀琀er he has been struck by the arrows of Love, he argues that both (courtly) love and (gallant) war are realised in the same chivalric ideals (“La chevalerie, loi de l’amour et de la guerre”: chivalry, law of love and war).

De Rougemont’s work has received some criticism, because there is no unique de昀椀nition of love in the Middle Ages, and that the re昀椀ned love昀椀n(’amor ) of the lyrical poet has substantial di昀erences with the passionate ‘crazy’ love of Tristan and Iseufto(l’amor) [ 12 ], and no medieval story 昀椀ts all aspects of courtly love as a deliberate choice (henceforth, adulterous in nature) and a reciprocal sentiment: the troubadours most o昀琀en complain of the disdain or excessive pride of the lady they love (hence, not always recriprocal), while Tristan and Iseut magical-昀椀lterinduced love does not perfectly 昀椀t the idea of a deliberate choice. Despite these reservation, and as far as the surviving documentation allows us to see, the 12th and 13th centuries saw an explosion of 昀椀ctional love stories, both of known writers such as Beroul, Chrétien de Troyes or Marie de France for instance, or in the many anonymous works of this period, sucFhloairse and Blanche昀氀ore orAucassin and Nicolette [ 2 ]. If these were written 昀椀rst in Occitan and French – be it Continental French, Anglo-French, or Franco-Italian –, equivalent in other Western European 1The more catchy “Love is a modern invention”, Seignobos explains, was a corrupted version of what he said to a lady, who told the journalist Gustave Téry, who told a colleague, Henri Bellamy, who in turn published it in the Quotidien, spurring Seignobos reply. languages were soon to appear, perhaps with the exception of Spain where the 昀椀rst examples of symmetric and passionate narrative love story arrive later in the 14th centu2r].yF[or instance, in addition to the French version of thTeristan and Iseut story by Beroul, Thomas of Britain, Marie de France and Chrétien de Troyes, we also see German (Gottfried von Strassburg), Italian (Tristano Riccardiano, Tristano Veneto and Tristano Corsiniano), English (Sir Tristrem) and Czech (Tristam and Izalda) versions, and slightly later in Spanish from the late 14th c. to the 162t]h. [ As Morris notes: “As far as our surviving evidence takes us, there was an enormous explosion of interest in the subject shortly before 1100. An almost complete silence was followed by the beginning of love literature which challenged in quality and surpassed in volume that of any earlier civilization2”9[ , 2 ].

The 14th and 15th century constitutes a period somewhat less explored with respect to Medieval French literature, despite – or probably because of – a large body of prose works, very often of consequent dimensions, and sometimes with abundant surviving manuscript traditions. Many of these works were new versions of previous texts, such as new versionTsroisftan and Iseut or Floris and Blanche昀氀ore , and they are accompanied by many seemingly new creations such asPonthus et Sidoine orCleriadus et Meliadice, while, in other European languages, the works of Chaucer or the Middle Dutch playsEosmf oreit orGloriant feature an important dimension of reciprocal love as passio2n].[ Yet, to some extent, it remains to be seen if, in Medieval narratives, the importance of the theme of love and passion actually increases or decreases in the literature of the Late Middle Ages.

1.2. … in broader perspective

If the importance of the development of love in medieval Western culture from the 12th century onwards cannot be denied, research now tends to put it back in a context where similar increases happened elsewhere in Eurasia, for instance in the Arab world, India, Persia, China and Japan, as well as in the West in other periods of time, such as the the Greece of the 昀椀rst to third centuries AD (that saw the production of ‘novels’ suchLeauscippe and Clitophon) [ 3 ].

The medievalist Georges Duby was perhaps the 昀椀rst to hypothesize that economic development might be the main driver in the rise of love in Western Europ1e4][. Recently, Baumard, Huillery, Hya昀椀l, and Safra3[] argued that a ‘higher level of economic development’ (approached through measures such as GDP per capita, urbanisation rate, size of the largest city, …) ‘is strongly associated with a greater incidence of love in narrative 昀椀ction’, in the Eurasian space, both in the Antiquity and during the Middle Ages and Early Modern per3io].d [

In line with these work, we thus test whether there are indeed shared trends between love and economic development in literary 昀椀ctions, on the speci昀椀c corpus that played a central cultural role in medieval Europe and spurred de Rougemont’s analysis: medieval French long narrative 昀椀ctions. We compare it to measures of economic development, based on data available data for GDP in medieval France32[ , 5 ]. It is to be noted that, if both appear correlated, it won’t necessarily mean that increase in GDP causes increase in taste for love in 昀椀ction, as both could be in昀氀uenced by external factors that are not yet included in our analysis, such as, for instance, political stability, or the absence of major shocks (wars, pandemics,…).

To go beyond the received (and relatively cramped) literary canon, in terms of works, authors, genres and periods, and to proceed on the material basis of the reception and popularity of the works, we build a large corpus of manuscriptschoafnsons de geste as well as verse and proseromans, from the 12th to the 15th century. By working on the basis of the surviving manuscripts, we hope to circumvent some biases due to, both, the romantic and scholarly reception of medieval works in the 19th-21th centuries, such as the overvaluation of works 椀昀tting contemporary aesthetic criteria (rather than popular during the Middle Ages), but we are subject to biases in the di昀erential preservation and destruction of these wor2k2s, 8[].

By including both epicsc(hansons de geste) and romans, we will also be able to go beyond traditional genre de昀椀nitions to monitor the importance of the theme of love, and its ability to cross generic boundaries.

Finally, to give context to the evolution of the theme of love, and to test the hypothesis of de Rougemont of a link between them, we will also follow the chivalric theme of violence and war.

2. Materials and Methods 2.1. Global design and justification

Our global experimental design is as follow: 1. gather a corpus as large as possible of manuscripts of medieval French epics and romances, through the harvesting of digitised manuscripts and their subsequent processing through a dedicated work昀氀ow using computer vision and natural language processing; 2. build a semantic representation of words and documents, based on a joint embedding, using doc2vec, and estimate its quality using literary knowledge; 3. compute scores for documents, based on cosine similarity in the joint embedding between them and the vectors of words for love, and for violence; 4. monitor their variation during the period; 5. compare the variation with historical knowledge on economic development: do they converge, diverge or seem unrelated? 6. (appendix) use top2vec, based on the doc2vec embedding, to look at the topics with high love or violence scores, to check that they are indeed related to courtly love, and to violence, by interpreting them in light of literary history.

As there is no unequivocal de昀椀nition of what constitutes the theme of love or of violence, we choose to constitute them by concatenating the vectorised word representations of: • the lexemesaimer and amour, and their in昀氀ectional and spelling variants that we could identify2; • the lexeme ferir (frapper, hit) and its in昀氀ectional and spelling variants that we could identify. ferir was chosen because it constitutes probably the most ubiquitous verb in descriptions of 昀椀ghts [19, p. 149]. 2We do not target speci昀椀cally physical love and sexuality (and important texts in that regard, such asFtahbeliaux, are indeed not included in our corpus), though these themes might still have appeared, as well as forms of non romantic love (e.g., divine, familial, etc.). Yet, the results below show that our vector captures almost exclusively courtly love themes.

2.2. Corpus

2.2.1. Scope The Corpus of Medieval French Epics and Romances introduced in this paper is, to our knowledge, the largest corpus of Medieval French created until now. Though still in early version and with only partial coverage of its 昀椀nal scope, it is as of now comprised of 265 manuscripts, and 410 text witnesses, for a total of 38.5 million word tokens. The deep learning based work昀氀ow for text acquisition from the digitised manuscripts images, as well as the subsequent ground-truth free quality evaluation of the results are depicteAdpipnendix A.

The goal of the corpus is to encompass every manuscript of medieval French long narrative works, that fall broadly in the categorycohafnsons de geste (epics) and chivalricromans (romances), chie昀氀y but not exclusively from the Matters of Rome or Britain, in verses, along with theirmises en prose and native prose versions, as long as they are available in digitised form.

For this paper, the scope was limited to digitisations available as part ofGthalelica digital library of the Bibliothèque nationale de France. In the near future, it will be expanded to include other sources (such as theBibliothèque Virtuelle des Manuscrits Médiévaux)3.

The inventory of works, texts and manuscripts (still ongoing) was made by collating a list of epics made by one author, data from thOepenStemmata repository 7[], with the list published by Kestemont et al. [ 22 ]. Veri昀椀cations were made by going back to the digital catalog of the BnF [ 4 ], and online databasesJonas and Arlima [ 20, 6 ]. In particular, data was enriched with links to available digitisations.

2.2.2. Corpus facts and figures

Main statistics about the corpus are presented iTnable 1. Due to the unequal availability of digitised manuscripts (and of the underlying sources), as well as selection on the basis of handwritten text recognition (HTR) quality, the corpus is not chronologically balanced, and no regularisation was performed on this aspect (which in part supposedly re昀氀ects also variations in manuscript production and preservation). 3The version used for this paper is limited to a subset of 258 manuscripts, due to time and computing resources constraints. It will be expanded in the course of the following months, to encompass all digitised manuscripts from our list of 800 manuscripts containing relevant texts and kept inBthibeliothèque nationale de France. In the context of the preparation of this paper, we have focused on increasing the diversity of texts rather than including, say, all the very numerous manuscripts of the prose arthurViaunlgate cycle or theGuiron le courtois.

The chronological distribution of the cmfseerle-ct corpus F(igure 1, le昀琀) shows that substantial data is available for all the period envisioned. It also shows an almost continuous increase during the 13th century, followed by a very important decrease, hitting a lowest point in the years following the Black Death pandemic (associated with a signi昀椀cant population drop). It then increases again in the 15th century, with another notable drop during one of the worst periods of the Hundred Years’ War, roughly the Armagnac–Burgundian Civil War (1407-1435). Even though biases in the availability of sources and choices made for the corpus are likely to be present, the correspondence with important historic events might be an argument for a form of representativeness of the corpus with respect to the medieval production, or perhaps with the inheritance of the Royal Library, established by Charles V the Wise in 1367 (the ancestor of the BnF, from which digitised manuscripts were obtained).

It is to be noted that the unequal distribution of tokens in time is not necessarily in itself a problem to estimate the average importance of love, as long as enough material is present throughout the period.

The distribution of number of tokens by worFkig(ure 1, right) shows, as expected, a very unequal distribution: works are most o昀琀en found in a single witness, but a handful of texts have an abundant tradition, re昀氀ecting their enduring success. Some of the latter are very long or cyclical works, and in de昀椀nitive, can amount for up to two orders of magnitude more tokens than most works: this is in particular the case of the proTsreistan (23 witnesses and 10M tokens, more than a fourth of the corpusG),uiron le courtois (9 witnesses and 4M tokens),L’Estoire del Saint Graal (20 witnesses and 3M tokens), orGarin le Loherain (11 witnesses and 1M tokens). We took the decision not to restrain the number of witnesses for a given text, because the aim of our analysis is to re昀氀ect the reception of the texts. In a context where books are expensive objects, commissioning the copy of a voluminous work is a signi昀椀cant choice.

2.3. Semantic representation of the words and documents 2.3.1. Model training

Given the level of lexical, spelling and abbreviative variation in the corpus, as well as the noise induced by the HTR process, and the current absence of subsequent normalisation such as lemmatisation, we are faced with an important amount of variant forms. To deal with this, we choose a method that is supposed to increase robustness to this type of variation, by creating a shared embedding of words, using word2ve2c7[], and documents, with doc2vec 2[ 5 ]. In addition, this allows us to use top2vec to extract topic vectors, to investigate the contexts in which our queried word vectors are used, to ascertain that they do, in fact, represent occurrences of courtly love or violence (sAeeppendix D).

Given the nature of our corpus, we are chie昀氀y interested in several of the main claimed features of these embeddings, in particular the fact that they supposedly do not need stemming or lemmatisation, nor lists of stop words. In addition, some benchmarks have also found doc2vec to be the most e昀케cient model over encoders, such as the Universal Sentence Encoder or BERT Sentence Transformer,2[ 1 ] when used in contexts such as topic discovery. Last but not least, since our main goal is to interrogate the documents based on the importance of the semantic content related to the forms of the lexemaems er/amour and ferir, the advantage of using a combination of word2vec, doc2vec and top2vec is that it allows us to manipulate and interrogate shared representations of word, document and topic vectors.

Given the large size of the texts, they were sampled in 15 lines fragments (resulting in 334 060 fragments). The doc2vec model was trained with mostly default hyperparameters, with additionnal adjustements based on existing benchmarks, and the speci昀椀cs of our corpuTsa(ble 2). In particular, regarding the number of training epochs, previous studies on Doc2Vec found the optimal number of epochs for a fairly large corpus (in terms of document length and number of documents) to be relatively low: for a 4.5 million words corpus, Lau and Bal2d4w] ifno[und the optimal number of epochs to be 20, as opposed to 400 for a 0.5 million words corpus, and the minimum frequency of a word in the corpus for inclusion to be 5 instead of 1. Curiskis et al. [ 13 ] showed that for a dataset of approximately 7 000 documents of a mean length of 140 words, the optimal number of training epochs was 50. Since our corpus is closer to 40 million words and 300 000 samples, respectively one and two orders of magnitude larger, we retain the option of training for 昀椀ve epochs, with a minimum count of 25. In addition, we chose to use negative sampling instead of a hierarchical so昀琀max step at the output layer because it proved both more e昀케cient and yielding better quality vectors in existing benchmarks28[ , 24 ], and chose a vocabulary based only on word 1-grams.

Training was made on a dedicated server, using 8 parallel workers.

2.3.2. Interrogation of the resulting vectors

Topics, texts and passages were then interrogated using the following methodology: 1. word vectors were interrogated on the basis of the lemmas ‘ametro’l(ove) and ‘amor’ (subst. love), on one hand, and ‘ferir’to( hit) on the other, to retrieve most similar words. Other forms (昀氀exional, spelling, segmentation variants, or variant forms due to HTR noise) of the lemmas were then identi昀椀ed, and added to the request (e.g., ‘amour’, ‘lamor’, ‘lamour’, ‘amors’, ‘amours’, ‘samor’, ‘damors’, ‘amo’, ‘amoit’, ‘lamoit’, ‘amee’, etc.), iteratively, until the most similar stopped yielding forms of the lemmAasp p(endix C). 2. those sets of words and their corresponding vectors were used to examine their direct environment, in terms of word vectors closest to them, as well as in terms of document vectors closest to them (both in cosine similarity), in order to establish the semantic contents and the nature of the works that they would retrieve, and to verify they were concerned with courtly love, and chivalric violence. In addition to this veri昀椀cation of the quality of the embedding, topic modelling was used more secondarily to look directly at the closest associated themesA( ppendix D); 3. 昀椀nally, those sets of word vectors were used to compute a love and a violence score (based on cosine similarity) for each document, and monitor the variation of this score through time. For this, the score for love and violence of all passages was retrieved, in order to calculate a yearly mean. This necessitated to distribute the passages chronologically based on their date or approximate dating: for instance, a passage in a manuscript dated to 1245 was assigned to the year 1245 with a weight of one; a passage dated to the last quar1ter of the 13th century was assigned to the years 1276 to 1300 with a weight of 1300−1275 = 0.04 for each year. The mean scores were then computed and plotted as a time serie, using local regression with the LOESS method (locally estimated scatterplot smoothing), with a smoothing coe昀케cient of 0.15.

3. Results 3.1. Semantic environment of love and violence

In order to inspect the validity of the vectors of love and violence, and to establish to what speci昀椀c kind of love or violence they referred, we looked at the contexts, through the interrogation of most similar word vectors, based on cosine similarity between our love and violence vectors (mean of word vectors that compose them) the vectors for each word in the modTealbl(e 3). We completed it through topic modellinAgp(pendix D).

The words closest to the love vector exhibits a catalogue of courtly love vocabulary: in the designation of the lovers, in the expectations, languishing troubles and (metaphoric or not) death from love (as well as potential love quarrels); the use of feudal vocabulary (loyalty, feudal possession), the expression of feelings and its traditional metaphoric elements (昀椀re, heart, …), as well as love promises, desire and kisses. We also 昀椀nd courtly qualities (beauty, goodness, high social extraction), and their traditional incarnated opposites, be they jealous or simply not possessed of courtly qualities (the villein and their supposed vileness or boorishness).

The words closest to thfeerir (hit) vector form an even more compact vocabulary: it is about hitting one’s opponent with o昀ensive weapons in the teeth, chest or shield, breaking pieces of armour, slashing, cleaving, slicing, piercing through, throwing him o昀 his horse, and ultimately killing him.

Given these results, we are satis昀椀ed that the word embedding o昀ers a relevant representation of (courtly) love and violence (especially, chivalric combats). We then move on to examine the document embedding, on the basis of these word vectors.

3.2. Document scores for love and violence

Document-level scores for love and violence were computed for each textual witness, by taking the mean score of all passages extracted from them. If we rank them accordingly (T4a)b,lwee notice in both lists the importance of manuscripts dating to the 13th and the early 14th century. The list of witnesses closest to the love vector show the importance of courtly love stories, in a mix of works whose literary importance is o昀琀en known and sometimes less so: the very famous Lai de l’ombre, for instance, is an archetypal courtly tale by Jean Renart, in which a knight seduces a woman that was refusing him, by gi昀琀ing a ring to her re昀氀ection in a fountain. The list also contains several adventure and love romances centred on a couple, suAcmhaadsas et Ydoine, Floire et Blanche昀氀or (here in its ‘aristocratic’ versionC)r,istal et Clarie. Several of these works share an Ovidian inspiration, and narrative patterns typical of courtly love (such as the gi昀琀 or exchange of rings). The works of Adenet le Roi also feature in good place, be it the courtly adventure romance oCfléomadès, or theBerte aus grans piés, in which he mixes epic sources with a 昀椀ne description of the feelings and troubles of its chief female character. Some lesser known texts 昀椀t quite well in this list: the highest scoring one is thReoman de la poire, a text in-between romance and lyrical poetry, in which “the themes of courtesy are present, with sophistication and re昀椀nement pushed to the extreme” [34, our translation], that centers around the initially non reciprocated love between the narrator and a lady, communicating through lyrical poems, and bene昀椀ting from the mediation of allegories of Love, courtly virtues (Loyalty, Subtle Thought, Gentle Gaze…) and characters borrowed from famous texts (Tristan and Iseut, Pyrame and Thisbé). The 14th centuryDame à la licorne et beau chevalier au lion is a comparatively late example, of a romance mixed with lyrical poems, in a manuscript that was likely gi昀琀ed to the princess Blanche de Navarre at the time of her wedding with the king of France. It is also an archetypal courtly story, in which a married young lady, accompanied by a unicorn, falls in love with a knight accompanied by a lion, and lives a story 昀椀lled with tropes such as the rumors of death of her lover, the slanders of the jealous against the couple, etc.

The presence of the Song of Saint Alexis is seemingly a discrepancy in comparison to the rest of the list, yet it might be explained by the centrality in the tale of the marriage of Alexis, from which he 昀氀ees, up to the end of the narrative, that 昀椀nishes with the lamentations of his wife before his dead body (that creates a discordant echo to the courtly ‘death from love’).

On the other hand, the witnesses closest to the violence vector are chie昀氀y epichsa(nsons de geste). For instance,Aliscans and the Chevalerie Vivien, that appear several times in the list, are centred around the eponymous battle in Aliscans between the Sarracen king Deramé of Cordoba and the Frank knight Vivien, who swore never to back down before the pagans, and endures an heroic death precisely because of his vow. It is interesting to notice the presence in this list, among more 昀椀ctional texts, of theConquest of Jerusalem, that draws on the events of the First Crusade (in particular, the siege of Jerusalem in 1099).

The nature of the documents closest to the love and violence vectors, when confronted to existing literary knowledge, con昀椀rms the quality of the document embedding, and the ability of our method to recognise the importance of courtly love or chivalric violence contents in the texts.

3.3. Median scores variation in time

Examining the variation through time of the semantic contents of the documents, year by year, by plotting the average yearly document similarity of the samples with the vectors of the love and violence sets of keywords, seems to yield a strong increase of the presence of love until, roughly, the years 1330-1340, followed by a tendencial decrease, until the end of the Middle Ages, roughly coinciding with the Crisis of the Late Middle Ages (or the Medieval Great Depression), though not completely. A comparison with reconstructed economic data [ 5 ] shows that the important crises of the beginning of the 14th century, in particular the Great Famine of 1315–1317, that coincides with a very large drop in estimated GDP per capita, does not seem to a昀ect the importance of love in 昀椀ction, though it might have contributed to the very signi昀椀cant drop in available manuscripts observed aboveFi(gure 1). If there seems to have shared trends, up to a point, in long term variation of economic development and love score (increase in the 13th century, decrease during the Crisis of the Late Middle Ages), the comparison of the two curves do not match perfectly, and perhaps hints at a time lag of a couple of decades in the latter. This might hint at a form of cultural inertia, especially in a context where textual transmission (the copy and circulation of texts) is a lengthy process, and by no means as 昀氀uid as in latter periods. It is to be noted that some of the increases of the GDP per capita are not necessarily due to increase of GDP, but instead to sudden decreases in population, such as the one caused by the Black Death in 1347-1351.

Violence, on the other hand, seem to start its decrease earlier, around the middle of the thirteenth century. This could coincide with the slow loss of favour of the genre of epics (chansons de geste), victim of the competition of the more recent genreroofmans, as well as the irruption in latecrhansons de geste of themes other than war: for instance individual adventure, love or wonder.

4. Discussion and future work

In building the corpus used for this study, we remain tributary of biases of the unequal preservation of documents through time, of large and small scale historical events, from the Great Plague to the ups and downs of the Royal Library, whose collections are the ancestor of those of the BnF that we used (cf.Figure 1). In addition, since manuscripts (especially those preserved) were expensive objects reserved to a certain elite, their contents cannot be claimed to represent the taste of society as a whole, but rather those of relatively wealthy and educated class (aristocratic or otherwise).

Yet, within the limits of these sources, we observe that we are able to build and query a semantic representation of the words and documents that exhibits many of the tropes of this literature, that researchers have studied through close reading. In particular, the semantic environment of the love word vector, both in terms of close words or documents, corroborates and sometimes enriches literary knowledge on the tropes of courtly love and the associated works. They align with several of de Rougemont’s ideas about the importance of the lyric tradition, as well as the strong link between the themes of love, love induced su昀ering and death.

The variation in time of the importance of love and violence shows initially opposite trends, that, a昀琀er c. 1340, seem to align more closely. In terms of literary history, this could correspond to the traditional epicchansons de geste focused on collective war against sarracens and feudal con昀氀icts slowly going out of fashion, and progressively aligning their content with the more modern genre of theroman, including individual adventures and love stories. Once epics have merged with chivalric romances, both seem to behave in similar ways through time.

Finally, the variation in time of the mean importance of love in the 昀椀ctions seem to show a phase of expansion until the early 14th century, when it then knows a downward trend during the period of the Crisis of the Late Middle Ages (with a time lag of roughly 20 years, the decline in love starting around 1330-1340, while the Great Famine of 1315-1322 traditionally marks the beginning of the Crisis). Further research is needed to explore this issue in greater depth, and test the correlation with economic development as well as other factors.

Data and materials availability

Data and scripts used for topic modelling are available on a Zenodo reposi1to0.r5y2:81/zenodo.10011791. The cmfer is also available on Githubh,ttps://github.com/Jean-BaptisteCamps/CMFER .

A. Acquisition workflow and evaluation of the corpus A.1. Workflow

The work昀氀ow for text acquisition is depicted in 昀椀g.3.

Manuscripts images are harvested using the International Image Interoperability Framework (IIIF), based on their manifest, then processed through layout analysis, using YALT1A0i][object detection approach, and the Gallicorpora mod3e1l],[ using SegmOnto ontology for the semantic typing of zones [ 16 ], in combination with a Kraken 2[ 3 ] model for the identi昀椀cation of lines. The resulting ALTO (Analyzed Layout and Text Object)/page images pairs are then passed to handwritten text recognition, using the deep learning approach of the Kraken so昀琀ware, and the CREMMA Medieval Generic model 1[ 1 ]. This model produces a version of the text that encodes abbreviations as such, and follows the graphematic conventions recently elaborated at the École des chartes, in a seminar led by A. Pinche, J.-B. Camps and F. Duval [ 30 ].

The resulting ALTO 昀椀les (one per page) are then processed through a dedicated script, to create a single raw text 昀椀le per witness (i.e., an instance of a given work in a given manuscript), with the relevant metadata in an accompanying tsv format.

A.2. Quality evaluation

We follow the approach recently described by Clér9i]c,efo[r ground-truth free evaluation of handwritten text recognition (HTR) of Old French. This approach is based on natural language processing, and aims to evaluate the apparent linguistic consistency of a text, rather than its match with the original line image. It takes the evaluation as a classi昀椀cation task, were a model is trained to classify transcribed lines in categories, that are supposed to approximate a level of character error rate: Good ([0, 10)%), Acceptable ([10, 25)%), Bad ([25, 50)%), and Very Bad (≥ 50%). For this, it uses a model based on a an embedding-sentence encoder-linear classi昀椀er structure. It produces as an output a classi昀椀cation of each line in each of the aforementioned categories (昀椀g. 4). We reuse the model provided by Clérice with the original paper.

To provide an estimate for each textual witness, we count the total number of lines in each categories, and compute a ratio, both for each category, but also for good and acceptable vs bad and very bad (昀椀g. 5). median ratio of good lines is 65% and the median ratio of good+acceptable lines is 94% (min: 9%; 1st quartile: 89%, 3rd quartile: 97%; max: 100%). Typical examples of results for maximum, median, 昀椀rst quartile and minimum values of good+acceptable lines are given in appendix (B).

Distribution of quality estimations by century shows that our model shows comparable levels of quality for the 13th and 14th century, with most manuscript above 80% of good and average lines, and a few outliers below. On the other, there is a decrease in quality for the 15th century, with also a less compact distribution. This can be explained by the signi昀椀cant number of 15th century manuscripts written in cursive scripts, with o昀琀en less formal execution, that di昀er signi昀椀cantly from the Gothic Textualis that otherwise dominates the corpus.

Outliers with a large number of bad or very bad lines exists, and they were removed from the corpus before further analysis. The threshold was set at 1.5 interquartile range below the 1st quartile (ratio of good+average lin≥es0.78), resulting in 370 texts selected for further analysis (Table 1).

B. Example of processing results for the di昀erent quality levels

M edeffendes uꝰ dex de cest fu ci deuant ⁊ ausi ꝯtu ses ca cort ai cest cmͦãt b iax sire deffendes cel ch̾ r enfant Qͥ por moi se ꝯbar acel serf mal faisant l ors chiet ariel la dame de peor uait dͤ mblãt ꝯtt ut .i. dꝰ len relieue ꝯ apeloit climãt s ili dist doloe dame neuos esmaies tant U es encor elyas le m̾ chi deu niuant A usi est ore endeu ꝯ al mienchemãt ⁊ la dame se taist qͥ ot le cuer dolant b ien furenit .iii.e. franc a iestor ꝯmẽc̾ T uit cistenuont ensenblep᷑ .pa. domag̾ c eiorineissie dance lances froisier ⁊ noz ienge . srap. sor. sarr̾ . aid'. a destre ⁊ a sencstre acrrãc iesrãs cera a mot parmices hiaumes seur ⁊ chaploier c eschieticecusaires ocirre ⁊ de crãch̃ ͤ Seu luirent encenble .iiii.e. charpẽtier Qi trestuit quipentataẽ ẜ chistei rericẽ Ne seissenta mie teluoise ⁊ teitenpier O u ses non ie uos rans ke pais C ar ie nel puis auser uegarãtir a uns men nai com uns autres cheas C il sunt dolant cont lapaiole out N ra celui qui ne fust abais.

O une plorast del biaux iex de sonuis i apostos les senest enpies leues Cememẽt plore sa sagent apeles S ignor clergie qͥl ꝯseil me douues I lẽ bien orois qͥ del uostre uueces

C. Composition of the love and violence vectors C.1. Love

All forms of verb ‘aimer’ (to love) and noun ‘amour’ (love) that were found were used. They are the following (forms with HTR errors are marked with an asterisk): aime, aimme, ama, amast, ame, amee, amer, amerai, ameroit, ames, *amo, amoit, amor, amors, amour, amours, *anier, *anne, *camoi, damor, damors, damour, damours, desamour, iaim, iaime, iamoie, laim, laime, lamasse, lamerai, lamoit, lamor, lamour, *laune, maime, mamast, *mor, *mour, *mours, naim, naime, namerai, quamours, samor, samors, samour, *sanie, taime.

C.2. Violence

All forms of verb ‘ferir’ (to hit) that were found were used. They are the following (forms with HTR errors are marked with an asterisk):

an昀椀ert, e昀椀ert, en昀椀ert, feri, ferir, feru, ferus, 昀椀ert, leferi, referi, re昀椀ert, *uaferir

D. Topic modelling D.1. top2vec model

In addition to word and document embeddings, we investigated the texts using top2ve1c].[ Recent studies have shown top2vec to yield qualitatively better results, and more coherent and human-readable topics than other topic modelling methods, such as the classic LD2A6, [ 15, 21 ]. It has been already used in large scale topic modelling of literary cor3p6o].ra [

In addition, top2vec automatically 昀椀nds the relevant number of topics, which will facilitate the handling of this large corpus by relieving us of doing long and computationally intensive benchmarks of arbitrary number of topics.

top2vec was trained reusing the doc2vec model described in the main text, with otherwise default hyperparameters. Th eparameter for the dbscan clustering of topics was set to 0.1 in cosine distance (i.e., topic vectors with a smaller cosine distance will be merged).

In the lack of benchmarks dedicated to variation-rich historical corpora similar to ours, we still conduced some degree of experimentation on the variation of these parameters (e.g., using longer n-gram vocabulary, adjusti ng,using top2vec ‘fast-learn’, ‘deep-learn’ alternatives, etc.), but not in a systematic fashion, due to the long training time for each model (up to 10 hours). Experiments resulted in apparently lower quality topics, with either a excessively small number of topics (e.g., 5 topics) or less signi昀椀cant topics with a predominance of function words.

The training with the chosen parameters yielded 276 topics.

We chose top2vec over BERTopic1[ 8 ], due to the unavailability of a pretrained BERT model compatible with the speci昀椀cs of our data, in terms of language and writing conventions, e.g., abbreviations (see next subsection) .

D.2. Experiments with BERTopic

BERTopic was another option that shared many of the strength of top2vec and performs especially “well on most aspects of the topic modeling domai1n5”,[p. 12]. BERTopic can run on any pretrained BERT model but is commonly associated with a multilingual pre-trained embedding model trained on Reddit and StackExchange, paraphrase-multilingual-MiniLM-L12-v2. Preliminary tests showed that the model is a昀ected byhistorical dri昀琀 , due to the increasing distance between older version of French written languages and the contemporary standard: BERTopic did run correctly on a set of 17th century French novels and to a lesser extent on a large sample of 15th century texts from our corpus. Before the 15th century, the results were totally inconclusive with one topic containing nearly all the corpus. The semantic map in 1昀椀g0. suggests that sentence embeddings have deteriorated to such an extent that it is no longer possible to recover regular clusters of topics.

We plan to pursue experiments with this method, to be able to compare its results with those presented here, once a pre-trained embedding model 昀椀tting our data is made available or is trained by us. Indeed, if there exists an Old French BBerEtR,Trade [ 17 ], but it is based on a corpus signi昀椀cantly smaller that the data we gathered (10 million words, as opposed to 40), and more importantly uses text editions with an higher level of normalisation (abbreviations expanded, in particular).

This call for a methodological remark: the rapid development of masked language models following BERT has created a new range of issues for historical studies. Most models are trained on very recent corpus and data. They are unlikely to cover past linguistic forms and writing and using pre-trained models alone, it would not yet have been possible to conduct this study.

D.3. Results of top2vec

The six topics which scored higher for the semantic similarity with the vector of ‘love’ word forms are shown in 昀椀g. 11. Interestingly, the two highest scoring topics both relate to the lyrical register of the complainplta(inte) for the pains and injuries caused by love and the act of (metaphorically) dying of love / being killed by love as they are found abundantly in our corpus inside theTristan en prose and its many lyrical poetry inserts. The death is again found in the 昀椀琀h topic, that concerns the songs of love in general, but also very particularly Lthaeis of theTristan en prose, especially thelay mortel (Deadly lai), the last love song sung just before dying of that same sentiment.

In apparent strong contrast, the third topic appear dedicated to the pleasures of love, its sweetness and the comfort it brings, through it can be closely nested with the previous one, as is demonstrated in one of the highest scoring passages, taken from a lyrical (and possibly parodic) part of thReomans de Fauvel, of which we give here an abstract with minor corrections to the HTR:

Q ue ie muir par tres bien amer E n ce que urai martir serai D ame en mourant me reconforte [My lady please remember that] I am dying because of loving very well, that I will be a true martyr of love, lady, this brings me comfort while I die.

Other passages were this topic is most represented are found in a variety of sources, from Amadas et Ydoine to the Roman des Sept Sages de Rome (Seven Wise masters) and its continuations.

Finally, the fourth and sixth topics are related to speci昀椀c works, tRhoemans d’Eneas and a somewhat less known and perhaps overlooked worBkl,ancandin et l’Orgueilleuse d’Amour..

Topics most related to violence concern, in an unsurprising manner, descriptions of battles and 昀椀ghts between a knight and his enemies. They are relatively straightforward to interpret and concern di昀erent but related aspects of knightly sword- and spear-昀椀ghting (昀椀g1.2).

[1]

Angelov . “Top2vec: Distributed representations of topics” . Ianr:Xiv preprint arXiv: 2008 . 09470 ( 2020 ).

[2]

Baumard . “ The Ancient Literary Fictions Values Survey” . IOns: f ( 2021 ). url: https: //osf.io/mvybs.

[3]

Baumard ,

Huillery , A. Hya昀椀l, and L. Safra. “ The cultural evolution of love in literary history” . In:Nature Human Behaviour 6.4 ( 2022 ), pp. 506 - 522 . doi: 10 .1038/s41562-022- 01292-z.

[4] Bibliothèque nationale de Franccea. talogue BnF Archives et manuscrits . Paris, 2023 . url: https://archivesetmanuscrits.bnf.f.r/

[5]

Bolt and J. L. Van Zanden. “ Maddison style estimates of the evolution of the world economy. A new 2020 update” . In:Maddison-Project Working Paper WP-15 , University of Groningen ( 2020 ).

[6]

Brun , ed.Arlima - Archives de littérature du Moyen Âge. Ottawa, 2005 . url: https://w ww.arlima.net/.

[7] J.-B. Camps , S.

Gabay , and G. F.

Riva . “Open Stemmata: A Digital Collection of Textual Genealogies” . InE:ADH2021: Interdisciplinary Perspectives on Data, 2nd International Conference of the European Association for Digital Humanities. Krasnoyarsk , 2021 . url: https://halshs.archives-ouvertes. fr/halshs-032600 . 86

[8]

J.-B.

Camps and

Randon-Furling . “Lost Manuscripts and Extinct Texts: A Dynamic Model of Cultural Transmission” . IPnr:oceedings of the Computational Humanities Research Conference 2022 Antwerp, Belgium, December 12-14 , 2022 . CEUR Workshop Proceedings. 2022 , pp. 198 - 214 . url:https://ceur-ws. org/ Vol- 3290 /long%5C% 5Fpaper3261 .pdf.

[9]

Clérice . “ Ground-truth Free Evaluation of HTR on Old French and Latin Medieval Literary Manuscripts” . InP: roceedings of the Computational Humanities Research Conference 2022 Antwerp, Belgium, December 12-14 , 2022 . Ed. by

Karsdorp ,

Lassche , and

Nielbo . Vol. 1613 . CEUR Workshop Proceedings. Antwerp, 2022 , pp. 1 - 24 . urhltt:ps: //ceur-ws. org/ Vol- 3290 /long%5C% 5Fpaper2081 .p.df

[10]

Clérice . “ You Actually Look Twice At it (YALTAi): using an object detection approach instead of region segmentation within the Kraken engine” . Ianr:Xiv preprint arXiv:2207.11230 ( 2022 ).

[11]

Clérice ,

Pinche , and

Vlachou-Efstathiou . “Generic CREMMA Model for Medieval Manuscripts (Latin and Old French), 8-15th century” . In: ( 2023 ). do1i0:.5281/zenodo.76 31619.

[12]

Corbellari . “Retour sur l'amour courtois”C. aInh:iers de recherches médiévales - Journal of medieval studies 17 ( 2009 ), pp. 375 - 385 . doi: 10 .4000/crm.11542.

[13]

S. A.

Curiskis ,

Drake ,

T. R.

Osborn , and P. J. Kennedy. “ An evaluation of document clustering and topic modelling in two online social networks: Twitter and Reddit” . In: Information Processing & Management 57.2 ( 2020 ), p. 102034 .

[14]

Duby. Mâle Moyen Âge: de l 'amour et autres essais . Paris, France: Flammarion, 1987 .

[15]

Egger and

Yu . “A topic modeling comparison between lda, nmf, top2vec, and bertopic to demystify twitter posts” . In:Frontiers in sociology 7 ( 2022 ), p. 886498 .

[16]

Gabay , J.-B. Camps , A.

Pinche , and C.

Jahan . “ SegmOnto: common vocabulary and practices for analysing the layout of manuscripts (and more)” . 1Isnt:International Workshop on Computational Paleography (IWCP ICDAR 2021 ). 2021 .

[17]

Grobol ,

Regnault ,

P. O.

Suarez ,

Sagot ,

Romary , and

Crabbé . “ BERTrade: Using Contextual Embeddings to Parse Old French” . In13:th Language Resources and Evaluation Conference . 2022 .

[18]

Grootendorst . “BERTopic: Neural topic modeling with a class-based TF-IDF procedure” . In:arXiv preprint arXiv:2203.05794 ( 2022 ).

[19] L. Ing. “ L'obsolescence lexicale en français médiéval: Philologie et linguistique computationnelles sur le Lancelot en prose” . PhD thesis . Université Paris Sciences et Lettres , 2023 . url: https://www.theses. fr/s22111 4.

[20] Irht .Jonas: Répertoire des textes et des manuscrits médiévaux d'oc et d'oıl̈ . Paris et Orléans, 2023 . url: http://jonas.irht.cnrs.f r./

[21]

Karas ,

Qu ,

Xu , and

Zhu . “ Experiments with LDA and Top2Vec for embedded topic discovery on social media data-A case study of cystic 昀椀brosis” . IFnr:ontiers in Arti昀椀cial Intelligence 5 ( 2022 ), p. 948313 .

[22]

Kestemont ,

Karsdorp , E. de Bruijn,

Driscoll ,

K. A.

Kapitan ,

P. Ó

Macháin ,

Sawyer ,

Sleiderink , and

Chao . “ Forgotten books: The application of unseen species models to the survival of culture” . SInci:ence 375.6582 ( 2022 ), pp. 765 - 769 . doi: 10 .1126 /science.abl7655.

[23]

Kiessling . “ Kraken - an Universal Text Recognizer for the Humanities” . DInig:ital Humanities Conference 2019 , Complexities, Utrecht (DH2019) . 2019 . url: https://web.arch ive.org/web/20210719115330/https://dev.clariah.nl/files/dh2019/boa/0673.h.tml

[24]

J. H.

Lau and T. Baldwin. “ An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation” . In:Proceedings of the 1st Workshop on Representation Learning for NLP . 2016 , pp. 78 - 86 .

[25]

Le and

Mikolov . “ Distributed representations of sentences and documents” . InIn-: ternational conference on machine learning. Pmlr . 2014 , pp. 1188 - 1196 .

[26]

Ma ,

Zeng-Treitler , and

S. J.

Nelson . “ Use of two topic modeling methods to investigate covid vaccine hesitancy” . InI:nt. Conf. ICT Soc. Hum. Beings . Vol. 384 . 2021 , pp. 221 - 226 .

[27]

Mikolov ,

Chen , G. Corrado, and

Dean . “ E昀케cient estimation of word representations in vector space” . Ina:rXiv preprint arXiv:1301.3781 ( 2013 ).

[28]

Mikolov , I. Sutskever,

Chen ,

G. S.

Corrado , and

Dean . “ Distributed representations of words and phrases and their compositionality” . IAnd:vances in neural information processing systems 26 ( 2013 ).

[29]

Morris . The discovery of the individual , 1050 - 1200 . Vol. 5 . Toronto: University of Toronto Press, 1987 .

[30]

Pinche , ed.Guide de transcription pour les manuscrits du Xe au XVe siècle . Paris, 2022 . url: https://hal.science/hal-03697382./

[31]

Pinche ,

Christensen , and

Gabay . “ Between automatic and manual encoding” . In: TEI 2022 conference: Text as data . 2022 .

[32] L. Ridol昀椀. “ The French economy in the longue durée: A study on real wages, working days and economic performance from Louis IX to the Revolution (1250-1789)” . PhD thesis . IMT School for Advanced Studies, Lucca , 2016 . urhlt:tp://e-theses.imtlucca.it/211 /1/Ridolfi%5C% 5Fphdthesis .pd. f

[33] D. de Rougemont. L'Amour et l'Occident. Republ . Online in Rougemont 2.0 (Genève) . Paris, 1939 . url:https://www.unige.ch/rougemont/livres/ddr1939.ao

[34]

Ruby . “ Thibaut” . In: Dictionnaire des lettres françaises: Le Moyen Âge . Paris, 1992 , pp. 1422 - 1423 .

[35]

Seignobos. “L'Amour est-il une invention moderne ?” InL:e Quotidien ( 1925 ).

[36] J. Van Zundert , M.

Koolen , J.

Neugarten , P.

Boot , W. Van Hage , and

Mussmann . “ What Do We Talk About When We Talk About Topic?” InP:roceedings of the Computational Humanities Research Conference 2022 Antwerp, Belgium, December 12-14 , 2022 . CEUR Workshop Proceedings. Antwerp, 2022 , pp. 398 - 410 . urlh: ttps://ceur-ws. org/ Vol- 3290 /shor t% 5C%5Fpaper5533.pdf.