<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Textual Transmission without Borders: Multiple Multilingual Alignment and Stemmatology of the &quot;Lancelot en prose&quot; (Medieval French, Castilian, Italian)</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Matthias</forename><surname>Gille Levenson</surname></persName>
							<affiliation key="aff0">
								<orgName type="department" key="dep1">Centre Jean Mabillon</orgName>
								<orgName type="department" key="dep2">École nationale des chartes</orgName>
								<orgName type="institution">Paris Sciences &amp; Lettres</orgName>
								<address>
									<country key="FR">France</country>
								</address>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="laboratory" key="lab1">CIHAM</orgName>
								<orgName type="laboratory" key="lab2">UMR 5648</orgName>
								<orgName type="institution">École Normale Supérieure de Lyon</orgName>
								<address>
									<country key="FR">France</country>
								</address>
							</affiliation>
							<affiliation key="aff2">
								<orgName type="institution">ÉquipEx Biblissima+</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Lucence</forename><surname>Ing</surname></persName>
							<email>lucence.ing@chartes.psl.eu</email>
							<affiliation key="aff0">
								<orgName type="department" key="dep1">Centre Jean Mabillon</orgName>
								<orgName type="department" key="dep2">École nationale des chartes</orgName>
								<orgName type="institution">Paris Sciences &amp; Lettres</orgName>
								<address>
									<country key="FR">France</country>
								</address>
							</affiliation>
							<affiliation key="aff2">
								<orgName type="institution">ÉquipEx Biblissima+</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Jean-Baptiste</forename><surname>Camps</surname></persName>
							<email>jean-baptiste.camps@chartes.psl.eu</email>
							<affiliation key="aff0">
								<orgName type="department" key="dep1">Centre Jean Mabillon</orgName>
								<orgName type="department" key="dep2">École nationale des chartes</orgName>
								<orgName type="institution">Paris Sciences &amp; Lettres</orgName>
								<address>
									<country key="FR">France</country>
								</address>
							</affiliation>
							<affiliation key="aff2">
								<orgName type="institution">ÉquipEx Biblissima+</orgName>
							</affiliation>
						</author>
						<title level="a" type="main">Textual Transmission without Borders: Multiple Multilingual Alignment and Stemmatology of the &quot;Lancelot en prose&quot; (Medieval French, Castilian, Italian)</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">76E9E9381D34698F49CD0678BEFE8387</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T19:50+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Multilingual alignment, Text segmentation, Medieval Arthurian literature, Stemmatology (J. Camps) 0000-0001-9488-5986 (M. Gille Levenson)</term>
					<term>0000-0002-8742-3000 (L. Ing)</term>
					<term>0000-0003-0385-7037 (J. Camps)</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>This study focuses on the problem of multilingual medieval text alignment, which presents specific challenges, due to the absence of modern punctuation in the texts and the non-standard forms of medieval languages. In order to perform the alignment of several witnesses from the multilingual tradition of the prose Lancelot, we first develop an automatic text segmenter based on BERT and then align the produced segments using Bertalign. This alignment is then used to produce stemmatological hypotheses, using phylogenetic methods. The aligned sequences are clustered independently by two human annotators and a clustering algorithm (DBScan), and the resulting variant tables submitted to maximum parsimony analysis, in order to produce trees. The trees are then compared and discussed in light of philological knowledge. Results tend to show that automatically clustered sequences can provide results comparable to those of human annotation.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>The production and transmission of written texts during Antiquity and the Middle Ages involved a process of manual copying. During this process, the text was progressively transformed by errors and innovations as it circulated, each copy introducing successive modifications. Since the 19th century, philologists have taken to study the transmission of texts, based on innovations, using the genealogical tree (stemma codicum) as a metaphor to visually represent this transmission process. Yet, the study of textual transmission is still often limited to the copies produced in a given language (e.g. Medieval French) and do not necessarily encompass all the translations, in other medieval languages, that were part of the history of a given work. This is due in part to stark difÏculties in aligning and analysing multilingual traditions, but</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.2.">The multilingual tradition of the Lancelot</head><p>The study is based on a few witnesses of a complex medieval tradition, that of the Lancelot en prose, an anonymous text composed in the first third of the 13 th century which enjoyed great success throughout the medieval period, with at least 126 manuscript witnesses, followed by many printed editions from 1488 onwards <ref type="bibr" target="#b12">[13,</ref><ref type="bibr" target="#b3">4]</ref>. This great success is also evidenced by the translations it has undergone (into Castilian, Catalan, Italian, High German, and Dutch). The alignment of texts in a multilingual framework allows, through the recording of variants, to establish the tradition of these translations.</p><p>In this paper, we focused solely on the Romance tradition, in particular the French source, and the translations in Castilian and Italian (with the exclusion of the Catalan translation, of which only small fragments remain). For both of these translations, only one complete witness is preserved: in a manuscript produced in Florence in the last quarter of the 14 th century, for the Italian Lancellotto (Firenze, Biblioteca della Fondazione Ezio Franceschini 1); in a 16 th century manuscript, copied from a 1414 exemplar according to its colophon, for the Castilian Lanzarote (Madrid, Biblioteca Nacional de España, 9611). Both texts have been edited <ref type="bibr" target="#b4">[5,</ref><ref type="bibr" target="#b8">9]</ref>.</p><p>Given the extremely large number of witnesses of the French Lancelot, we have selected a sample of five witnesses. They have been chosen on the basis of their supposed relationships with the translations, according to existing philological knowledge: Paris, BnF, fr. 111 (15 th c.) is supposed to be close to the Lancellotto <ref type="bibr" target="#b5">[6]</ref>, while Paris, BnF, fr. 751 (13 th c., 2 nd half), or more precisely the family of which it is a part, would be close to the Lanzarote <ref type="bibr" target="#b8">[9]</ref>. In addition, due to their easy availability and well known versions, we added several reference points, in the form of the edition by Sommer, based on ms. London, BL, Add. 10293 (beg. 14 th c.) <ref type="bibr" target="#b21">[22,</ref><ref type="bibr" target="#b22">23]</ref> and Micha, based on mss Cambridge, Corpus Christi College, 45 (13<ref type="foot" target="#foot_1">2</ref>/2 ) for volume 2 and Oxford, Bodleian Library, Rawlinson D. 899 (14 th c.) for volume 4 <ref type="bibr" target="#b16">[17,</ref><ref type="bibr" target="#b17">18]</ref>. Finally, as a representative of the late French tradition, whose place in the genealogy remains to be elucidated, we included the incunabula edition of 1488 (from exemplar Paris, BnF, RES-Y2-46 and RES-Y2-47; Rouen, Jean le Bourgois et Paris, Jean Dupré, 1488). For a list of the witnesses, see Appendix A.</p><p>The Lancelot is a very long prose text, and therefore it can be highly unstable from one witness to another. The witnesses containing the translations are not spared from this instability, and they are also fragmentary. Lancellotto is especially so, as it presents only three nonconsecutive episodes of the text. To enable alignment, the first step was therefore to identify corresponding passages from one witness to another (Appendix B), and to retain only what could be compared. This is why the studied text segments have identifiers: ii-48, ii-61, and iv-75, corresponding to the sections of the text as they appear in Micha edition. 2   </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.3.">State of the art on sentence alignment</head><p>While multilingual alignment is a fairly active field in NLP, relatively little work has been done on the production and use of alignment methods for philological, stemmatological and ecdotic purposes, in the field of heritage text with pre-orthographic languages.</p><p>Birnbaum and Eckhoff <ref type="bibr" target="#b2">[3]</ref> are considering the creation of a multilingual alignment tool, based on parts of speech only, which works in cases of literal translations, such as the Old Church Slavonic and an original Greek version of the Codex Supralensis. The work of Meinecke, Wrisley, and Jänicke <ref type="bibr" target="#b15">[16]</ref> on French epic literature, explores the possibilities offered by the use of semantic representations of words through embeddings, from a unilingual perspective only. However, nothing is said about the definition and identification of "sentences" in the source texts, as the unit chosen to compare the versions seems to be the verse. Yet, embedding-based similarity calculations seem to be a promising venue for text alignment <ref type="bibr" target="#b26">[27]</ref>.</p><p>Recently, Liu and Zhu <ref type="bibr" target="#b14">[15]</ref> have published a tool called Bertalign, a two-steps algorithm that makes use of sentence-transformers and multilingual sentence embeddings, based on the LaBSE model ("Language agnostic BERT Sentence Embeddings") <ref type="bibr" target="#b0">[1]</ref>. In particular, it allows to align 1 to 𝑛 and 𝑛 to 1 texts fragments. Bertalign works with segments fixed upstream and is designed for modern language states.<ref type="foot" target="#foot_2">3</ref> </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Methodology</head><p>In this paper, we use the fixed fragments method designed by Liu and Zhu and we make use of Bertalign to perform the global alignments, while at the same time proposing a specific segmentation approach prior to the alignment. <ref type="foot" target="#foot_3">4</ref> We describe our processing chain: segmentation, pseudo-sentences alignment, classification into distinct variants and stemmatological analysis.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">Segmentation</head><p>The alignment task requires fixed segments, as the tool we rely on, Bertalign, is designed for contemporary languages for which sentence segmentation is not a problem (the period ". " is enough to split the corpus, as modern translations tend to reproduce the same divisions sentence by sentence). It can be an issue for medieval languages, where the notion of a modern sentence, which begins with a capital letter and ends with a period, doesn't really exist. Punctuation is highly variable depending on the copyist, even if a global and comparative study has yet to be carried out. Let's take the following two sentences as examples:</p><p>BnF fr. 111 : ains sem part si tost quil leut commandee a dieu et cheuauche en telle maniere iucqua tierce tant quil uient a lissue de la fourest. et il tourne a destre vers ung chemin uieil et ancien.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>BnF fr. 751 : Ains sen part si tost comme il lot comandee a dieu. et cheuauche en tel maniere. tant quil uient a lissue de la forest et il torne a destre uers le chemin uiez et ancien</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Translation (for both passages): [But he goes away as soon as he commanded her to God and rides [until 9 a.m.], until he comes to the entrance of the forest, and he turns right towards an old pathway.]</head><p>The punctuation of these two transcriptions is original: it shows that it is not a reliable marker for syntactic segmentation, as it can vary greatly depending on the source, and this variability increases during the transcription and editing of the text. Moreover, original punctuation is still very little taken into account by editors, who re-punctuate the text and can make highly variable choices. It is therefore necessary to divide the texts into equivalent segments that can then be compared and aligned. To do this, we have decided to base our approach on sentence syntax. The assumption here is that there is a certain degree of correspondence between syntactic and semantic units, and that the syntactic tokenisation should be an efÏcient way to approach the different semantic units of the text. Let's take an example (Table <ref type="table" target="#tab_0">1</ref>): in the second line, "si" is translated by "e" in Castilian, but the syntactic units are preserved, allowing the alignment of segments that are formally quite different but share the same overall meaning. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.1.">Segmentation methods</head><p>To segment the texts, we try, compare and evaluate two methods: the first uses regular expressions, the second one uses AutoModelForTokenClassification BERT models.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Using regular expressions</head><p>The first method we tried was based on the identification of function words, which required a language-by-language analysis of syntactic delimiters. Regular expressions have been developed to facilitate this identification and manage the high degree of graphical variation. Here are a few examples of such markers in French: In Spanish:</p><p>"[Pp]ues que", "[pP]or?que", "[pP]er", "[Qq]ue", "a que", "si", "[Dd]o", "[Ee]", "commo quier que", "mas", "ass?i como", "como", "[Aa]n?ss?i", "para", "aquel que"</p><p>And in Italian:</p><p>"[mM]a si", "[Ss]i", "tanto che", "che", "e", "donde", "ch'?"</p><p>This regular expression method is limited, however, as it doesn't allow for fine segmentation in the case of repeated markers (you have to think of all possible combinations), which leads to over-segmentation of the text, as we'll see below.</p><p>Creating an ad hoc segmenter Another problem with rule-based segmentation is that it relies solely on the identification of character strings. Besides the fact that this leads to a proliferation of rules for languages with significant graphical variation (ainsi, insi, einsi for thus, for instance), it also identifies tokens that do not serve to delimit coherent syntactic units. Indeed, for example, the coordinating conjunction et (and) is a good delimiter of clauses and sentences, but it also serves to coordinate elements within the same syntactic units.</p><p>Let's have a look at the following two sentences: This is why the use of language models for identifying delimiters seemed relevant. Indeed, these models allow for capturing the semantics of different tokens, and thus differentiating the uses of the same words. It was decided to use AutoModelForTokenClassification BERT models <ref type="bibr" target="#b27">[28]</ref>, with the aim of identifying tokens that can serve as delimiters of these textual units. These models are usually trained on contemporary languages, but, as we will see, the large amount of data used to train them allows for good generalization and good results on our ancient language states. <ref type="foot" target="#foot_4">5</ref>To produce specific training data for our corpus, tokens identified as delimiters were labeled with the value 1, all other tokens with the value 0, based on human evaluation and a specific set of grammatical rules.<ref type="foot" target="#foot_5">6</ref> </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.2.">Segmenter train corpus</head><p>The models were trained on three languages: French, Castilian and Italian, in their medieval forms. The medieval Castilian corpus comprises around 8,000 tokenised lines (about 36,200 words), half of which are various texts from the 13 th to 15 th centuries, retrieved from the CORDE linguistic database <ref type="bibr" target="#b18">[19]</ref>. The other half of the corpus consists of fragments taken from the Lanzarote, which are not included in the corpus dealt with in this paper. The French and Italian models were trained solely on data from the Lancelot, which is debatable from a model generalization perspective but was motivated by the desire to quickly achieve effective models for our texts. They were trained on 12,700 and 28,000 words (1,000 and 1,500 lines), respectively. The difference in the number of words required to produce convincing models is due to the size of the initially used BERT models (the French BERT model is much larger as it was trained on a much larger dataset, making it more efÏcient for the given task).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.3.">Results</head><p>The results of the best segmentation models are presented in Table <ref type="table" target="#tab_3">2</ref>, Table <ref type="table" target="#tab_4">3 and Table 4</ref>, one for each language, according to classic measures of accuracy, precision, recall, and F1 score <ref type="foot" target="#foot_6">7</ref> . Each table displays the results of a model established using regex and a BERT-based model. The results show a substantial improvement in the text segmentation task, thanks to the use of a BERT-based model compared to the use of simple regexes.</p><p>The best model for each language is chosen on the basis of a weighted average between precision and recall, on a test corpus that has never been used for training. Given the whole workflow, recall is the important metric. It's more important to identify as many true delimiters as possible than to be precise in identifying delimiters. This is because the alignment phase that follows enables the alignment units to be merged, thereby compensating for false positives. Conversely, false negatives (failure to identify a delimiter) will not be compensated for later on, and will lead to an alignment of poorer quality. In view of this, it was decided to produce a weighted average between recall and precision for the selection of the best model, and to assign a weight of 2 to recall, and 1 to precision. Accuracy is high because it computes the total number of correct label assignments, both for labels 0 (non-delimiter tokens) and 1 (delimiter tokens), across the texts.</p><p>It is the second label, "Delimiter", corresponding to the results of identifying tokens labeled 1, that is relevant. Between the F1 scores of the regex and BERT-based models, we observe a significant improvement, with an increase of over 0.20 for French, 0.24 for Italian, and 0.23 for Spanish. Our models thus successfully segment the text according to a syntactic (and semantic) logic.<ref type="foot" target="#foot_7">8</ref>  We can see a short example of the segmentation produced by both methods: </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>BERT-based: et bien paroit a ses iauz [SEP] qu'ele avoit rouges et anflez [SEP] qu'ele eust ploré</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Regex: et bien paroit a ses iauz qu'ele avoit rouges [SEP] et anflez qu'ele eust ploré</head><p>The regex segmentation cut the sentence once, in the middle of a phrase ("rouges et anflez" that is "red and swollen"), whereas the BERT-based segmentation correctly segments the sentence twice, in accordance with the grammatical structure, based on relative pronoun and subordinative conjunction (qu'). The method chosen also has an influence on segment size, which is not shown in the evaluation above, and which has an impact on the quality of the alignments produced, in favor of the BERT-based method, as we'll see below.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">Alignment</head><p>Once the BERT-based models are trained and selected, they are integrated into the alignment workflow, producing text segmentation based on the recognition of the good delimiters, before the actual alignment phase. As stated above, Bertalign is used to perform the alignments 9 . The alignment is carried out on the basis of a main/pivot witness chosen in advance on the basis of philological knowledge, that is, Micha <ref type="bibr" target="#b16">[17]</ref>. This pivot witness is compared with each of the others, and the alignments are merged. To create the merged alignments table we make use of a graph method to connect each pair of aligned segment and create the final alignment unit, while conserving the 1 to many and many to one alignments. Considering each aligned pair as connected nodes in a graph, it is possible to build up complete alignment units by connecting all nodes together thanks to the pivot fragment<ref type="foot" target="#foot_8">10</ref> (figures 1 and 2). 9 The parameters we used are the following: max-align=3, window=5, skip=-.2, margin=True, len_penality=True. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Segm1Pivot</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.1.">Alignment evaluation</head><p>To assess the impact of the chosen segmentation method on alignment, we conducted an evaluation of the alignment produced by each of the two methods (regex and BERT-based segmentation).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Description of alignment results</head><p>The evaluation is based on correcting the alignments obtained (correcting the alignment table of indices). Given that segmentation varies between different segmenters, the correction process must be repeated for results from both BERT-based segmentation and regex segmentation. Table <ref type="table">5</ref> shows some alignment results with the fragment iv-75-1 (see also Appendix D, Table <ref type="table" target="#tab_10">11</ref>). These results present the same alignment error in Lancelloto and BnF fr. 111, but due to two different causes: in the case of the Lancelloto, the error can be attributed to the segmentation phase, where the segmenter saw a single unit and did not separate between "consiglio" and "chi". In the case of BnF fr. 111, segmentation was done properly, but the error was produced in the alignment phase, where the two units were regrouped.</p><p>Evaluation Alignments are evaluated on the basis of the Alignment Error Rate (AER) [25, 2.6, p. 21]:</p><formula xml:id="formula_0">𝐴𝐸𝑅 = 1 − 2 × |𝐿 ∩ 𝐿 gold | |𝐿| + |𝐿 gold |</formula><p>but it concerns a small number of alignment units only.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 5</head><p>Example of alignment automatically produced, with regex segmentation on the segment iv-75-1 (small extract). The meaning of the passage is "…that I would remedy it as I can". We can seen an alignment error with witness BnF fr. 111. Where L and 𝐿 gold consist of predicted and hand-corrected links. The evaluation is conducted on indices by comparing automatically generated indices with corrected ones. As the alignment is produced for each witness compared to the pivot witness, an AER is produced for each pair, and Table <ref type="table">6</ref> presents the average of those in order to have an overall value for each evaluated segment of text.</p><p>We distinguish between two types of results: results that consider incorrect alignments due to segmentation issues, and those that do not. The evaluation is performed on different parts of the text, each with varying numbers of alignment units.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 6</head><p>Alignment Error Rate with LaBSE, excluding or including segments that were not properly segmented. The numbers of evaluated segments are in the "Nb of segm. " columns. Numbers in bold are detailed infra (improvement of the results with BERT-based segmentation for ii-48 and ii-61-1 sections but not for the iv-75-1 one). We observe a systematic improvement between the results that consider erroneous segmentation and those that do not, both for regex segmentation and BERT-based segmentation. Several sections of text were evaluated using both methods, revealing a slight overall improvement in results with BERT-based segmentation.</p><p>It is important to note that the quality of alignment varies depending on the textual proximity among versions in each witness. For example, there is a gap (0.15) between the results for sections ii-48 and ii-61-1, which can be attributed to varying lacunas and omissions observed in several witnesses for the latter. <ref type="foot" target="#foot_9">11</ref>Moreover, the poor results observed for the iv-75-1 section (better results are observed for regex segmentation than for BERT-based segmentation) can be attributed to numerous instances of missing correspondence among several witnesses, reflecting divergent traditions with frequent omissions. In this context, regex segmentation, generating fewer segments (669 compared to 797 in this case), yields better alignment. This is because these segments consolidate more units, potentially accommodating omissions that might affect other segments. In contrast, BERT-based segmentation, which offers finer granularity, tends to make more errors.</p><p>There is indeed a notable difference in alignment results due to the size of the produced data: regex segmentation generates less segments, i.e., aligned units, as illustrated in Figure <ref type="figure" target="#fig_1">3</ref>. BERT-based segmentation results in a significantly higher number of segments containing 3 to 14 tokens (even until 21 tokens), which correlates with the higher total number of segments obtained with this method. Conversely, regex segmentation produces more segments with a higher number of tokens (more than 30), as well as segments with one or two token(s). These segments are either too large to facilitate accurate alignment or too small to maintain semantic coherence. Thus, alignment tends to be of better quality with the BERT-based method for stemmatologic purposes.</p><p>Correlatively, the size of the units (the units produced by the segmentation) within the segments changes, as we can see from the figure <ref type="figure" target="#fig_1">3</ref>. With the regex segmentation, there is a high number of very small units containing only one, two, or three tokens. In contrast, there are more units containing 5 to 8 tokens with BERT-based segmentation. As the units in regex segmentation are very short, they are grouped easily into segments. If this facilitates better results for the highly variable parts of text we described above, this approach results in a generally poor alignment, whereas BERT-based segmentation produces fewer segments with more semantically consistent units. If the results from the AER are not very promising, we can assume that the BERT-based segmentation produces a more coherent segmentation that positively impacts the alignment task. It is important to note that AER is a good metric for evaluating the quality of an alignment, but that it is not sufÏcient. Indeed, a text can be split into only three parts and have a low AER, if the splits are correct, whereas this kind of result would not allow for any meaningful exploitation of the data. On the contrary, the evaluation of the distribution of units and segments lengths shows that BERT-based segmentation provides more coherent elements to align.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3.">Classification in variants 2.3.1. Method</head><p>Most stemmatological and phylogenetic algorithms needs features (i.e. variants) that are arbitrarily coded (usually, using letters or numbers), e.g. variant 1, variant 2, etc. <ref type="bibr" target="#b6">[7]</ref> In our case, this necessitates, for each alignment unit, to cluster the readings from every witnesses into sets of relevant clusters (Table <ref type="table">7</ref>). The number of these clusters can vary from 1 (all witnesses have the same variant) to a maximum number equal to the total number of witnesses (each witness bears a different variant).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 7</head><p>Table showing an aligned portion of text for each witness, and its numerically coded classification in variants by both annotators (here in agreement) and a clustering algorithm. The main opposition here is between "heard" (oy, udito,…) and "done" (fait), and an omission. An OCR mistake ("or" for "oï ") in part prevents a correct clustering by the algorithm.</p><p>Witness As such, this task is an unsupervised clustering task, where the number of desired clusters cannot be known in advance, which excludes most clustering methods (such as, for instance, k-means or k-medoids). For this reason, we choose an unsupervised density-based clustering method, DBScan (Density-based spatial clustering of applications with noise <ref type="bibr" target="#b10">[11]</ref>), that we apply to the cosine distances between each witness, for each alignment unit, in the embedding.</p><p>DBScan tries to find dense clusters of points, i.e., points with a given minimum number of neighbours (MinPts) situated at a given maximum distance (𝜖). If DBscan does not necessitate to set the number of clusters, it requires the setting of the two parameters MinPts and 𝜖. These parameters can only be chosen contextually, in relation to the number of individuals (witnesses) and the distances between them. For the MinPts, we set it at 1, given the small number of witnesses. For 𝜖, we follow a methodology called DMDBScan, that tries to identify relevant values, by first computing the minimum distances to the MinPts nearest points, and then plotting them by ascending order to identify values for at which there are sudden sharp changes in the minimum distances, possibly revealing different densities <ref type="bibr" target="#b9">[10]</ref>. Following this methodology, we set 𝜖 at 0.2 (Appendix F). DBScan is applied to the cosine distances between the readings of the witnesses in the embedding. Yet, due to the multilingualism of the corpus, and the important spelling variation characteristic of medieval language, some distances are artificially increased not for semantic reasons but because of formal bias. To correct for this bias, a weighting is applied to all distances prior to performing clustering, according to:</p><formula xml:id="formula_1">weightedDist(𝑎 𝑖 , 𝑏 𝑖 ) = dist(𝑎 𝑖 , 𝑏 𝑖 ) 1 2 (𝜇(dist(𝑎, 𝑛)) + 𝜇(dist(𝑏, 𝑛)))</formula><p>where 𝑎 and 𝑏 are two witnesses, and 𝑖 a given alignment unit, and where 𝜇(dist(𝑎, 𝑛)) is the arithmetic mean of the distance between 𝑎 and all other witnesses for all segmentation units.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3.2.">Results and evaluation</head><p>In order to evaluate the results of clustering, we compare the clustering performed by DBScan to that of the two human annotators (the first two authors of this paper), and compute the mean Adjusted Rand Index (ARI), whose value is contained between -1 and +1, where -1 denotes a complete opposition, 1 a complete agreement, and 0 a case where both clustering appear to have been independently randomly labelled. We compare the mean ARI between the two human annotators and the DBScan result. The results are presented in Table <ref type="table" target="#tab_8">8</ref>. The score is very low for the DBscan results, especially compared to the correspondance between the human annotators. Yet, it does not necessarily mean that the clustering results does not yield significant genealogical information to be processed by the stemmatological algorithms, as will be seen in the next section.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Results of the stemmatological analysis</head><p>The alignment tables we obtained through the BERT-based segmentation and the alignment phase enable philological analysis. It was performed in two distinct ways: first, a traditional stemmatic analysis, based on human expertise, was performed. Then, the variants resulting from the human annotations and the DBScan clustering were subjected to stemmatological algorithms, and the results were compared.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Human-identified witness groups</head><p>We decided first to evaluate the links between the witnesses, based on the alignment tables and human close reading expertise (ours, and that of previous philologists having worked on this topic). Each one of the witnesses we chose presents particular readings (subsection E.1). However, we can distinguish two main groups within the tradition. On one hand, the witnesses Micha, BnF fr. 751, BnF fr. 111 and Lancellotto, and on, the other hand, Sommer, the incunable, and Lanzarote. They can be determined on the basis of omissions or distinct reading groups (subsection E.2). The oppositions between the two groups are quite frequent and stable.</p><p>Nevertheless, Lanzarote has common readings with the first group and sometimes oscillates between the readings of the two groups (subsection E.3). In fact, the text in the Lanzarote presents a quite original version, especially due to the shortened passages it contains (subsection 1.2). It also often offers specific readings and innovations, particularly when the translation presents cases of direct speech instead of basic narration (subsection E.4).</p><p>The text of the Lancellotto doesn't show such originality. On the contrary, it is very close to the French text, using many gallicisms <ref type="bibr">[5, p. 43-44]</ref>. It also stays very close to its own group. It shares particular common readings with BnF fr. 111, as well as several omissions. However, sometimes, Lancellotto knows a common reading with Micha against BnF fr. 111 (subsection E.5). The Lancellotto also shows some innovations that prove that the very model of the translation is one not part of our selection (subsection E.6).</p><p>We can assume that the witnesses we studied belong to two main groups, and that the witnesses that interest us, the Lanzarote and the Lancellotto, each fall into one of these groups. However, the specific variants and innovations found in each witness do not allow the identification of a specific model or the determination of a more precise afÏliation.</p><p>The identification of groups confirms the filiation between BnF fr. 111 and Lancellotto <ref type="bibr" target="#b4">[5]</ref>, but does not confirm the specific link between BnF fr. 751 and Lanzarote. It is important to note that BnF fr. 751 is really unstable. Indeed, if it shares most of its readings with the first group, in the variants we studied, it also has its own unique readings.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Results of the stemmatological algorithm</head><p>The second approach applies a well known phylogenetic algorithms to the table of variants produced by the two human annotators, and the one produced by the DBScan clustering. The chosen algorithm is the Branch and bound algorithm, whose goal is to determine minimal evolutionary trees by the criterion of maximum parsimony <ref type="bibr" target="#b11">[12]</ref>. This algorithm can produce one or several trees, if different configurations achieve the same maximum parsimony value. The trees resulting from the analyses are presented in Figure <ref type="figure" target="#fig_2">4</ref>.</p><p>The trees resulting from the tables produced by the human annotators present, unsurprisingly, strong similarities with the result of the human close reading expertise on the genealogy. In particular, they all display the groupings {Lanzarote, Sommer, Incunabula} versus {Lancelloto, Micha, fr. 751 and fr. 111}. The second and third alternative trees from Annot. 2 also show the grouping of Lancelloto and fr. 111, already observed by the editor of the Lancelloto <ref type="bibr" target="#b5">[6]</ref>, but only the third tree from Annot. 2 shows the stronger similarity between fr. 751 and Lanzarote, that was hypothesised by its editor <ref type="bibr" target="#b8">[9]</ref>.</p><p>In comparison, the results based on the DBScan do not show fully the opposition {Lanzarote, Sommer, Incunabula} versus {Lancelloto, Micha, fr. 751 and fr. 111}, because Sommer switches families. Yet, it provides results more consistent with the analyses of the editors of the Lanzarote, that is positioned close to the fr. 751 <ref type="bibr" target="#b8">[9]</ref>, while Lancelloto remains relatively close to </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>fr. 111.</head><p>If the results obtained from the automatic clustering of variants differ from the expertise of the human annotators, on the other hand the genealogical results do not necessarily contradict previous expertise, and seem to remain significant up to a point. Deciding whether the human annotators or the clustering have provided the most useful data remains to be fully assessed, but current results indicate that, even if the clustering differs in terms of classification of readings from human annotation (as shown by the mean Adjusted Rand Index), the results obtained through stemmatological analysis still remain indicative.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Conclusion</head><p>We've designed a tool that can facilitate the study of unilingual and multilingual variant traditions, and its stemmatological analysis. The results obtained through clustering and stemmatological analysis tend to show that an automatic clustering of variants and the application of a phylogenetic algorithm can quickly produce results that are indicative of the genealogy of witnesses, and comparable to some extent to a traditional human expertise.</p><p>Yet, in the current state, many aspects of the processing workflow could still be improved, especially if the results of the automatic alignement and collation were to be used for scholarly editing purposes.</p><p>As far as segmentation is concerned, the delimiter tokens are those that reliably identify coherent grammatical segments (relative pronouns, coordinating conjunctions between clauses, and subordinating conjunctions, etc.). This is a problem for all sentences that begin with personal pronouns, articles, or nouns and not with the markers listed above. Therefore, we need to produce more segmented data to be able to distinguish all these complex cases.</p><p>As for the alignment step, we show that LaBSE's sentence embeddings model works surprisingly well on medieval language states. Still, there is room for improvement and some fine-tuning of LaBSE model on aligned medieval texts should improve the results. Variant clustering should also benefit from this fine-tuning. However, it will be a complex and timeconsuming task, because of the required data, that is aligned texts: this step is planned for the future. Another important step will be microscopic alignment, which involves further work on modeling the alignment process, because it differs quite significantly from the alignment of monolingual texts, particularly with regard to the variations in the word order in the target sentence and in the source sentence. The micro-alignment could allow also more precise results in the stemmatological step.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Materials and code availability</head><p>The tool for alignment and collation is freely available on Zenodo:10.5281/zenodo.12732533 and is maintained on github: https://github.com/ProMeText/Aquilign/. The result data can be found in the results_dir directory. The segmenter train corpus is available in the subfolder: data/tokenisation. The data and scripts for clustering and stemmatological analysis are available on a specific Zenodo repository: 10.5281/zenodo.12728282.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Witnesses and segments of text</head><p>The Lancellotto is particularly fragmentary. Only the sections of the text corresponding to these fragments were studied.</p><p>According to Micha, its corresponds to the sections: II, XLVIII 29 -L 11; II, LXI 26 -LXIX 23; IV, LXX 1 -LXXXI 17. But, for the last, the Lanzarote omitted the text and starts at IV, LXXV, with a very reduced portion in LXXX 16 to 43. Thus, the studied section of text starts at IV, LXXV.</p><p>We indicated here the witnesses and the folios where the segments of the text can be found. We can note that, as we indicated in the introduction, the fr. 333 only contains the third studied segment and that the text corresponding to the incunabula is contained by two different volumes.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Lancellotto</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Example of missed section</head><p>As example of missed sections of text that can have an impact on the alignment process, we can take the fight in the episode of the hawk (corresponding to the segment ii-61-2).</p><p>In Micha's edition, the text is the following (the whole episode is given by all the other studied witnesses):</p><p>… tant qu'il vienent en une valee. Et lors li mostre un poi en sus del chemin a senestre la loge dont li chevaliers issi, dont ele se plaint. « Venés avant, fet il, seurement, et se vos veez l'esprevier, si le prenés, ja por nului nel laissiés. Et je vos creant loialment que je le vos garantirai a mon pooir contre tos cels qui le voldront contredire. Et se li espreviers n'i est, si me mostrez le chevalier qui le vos toli et jel vos ferai amender tot a vostre volenté. -Sire, fet ele, de Dieu aiés vos bone aventure ! Mais je voldroie miels que vos le me poissiés rendre a pes que a guerre. -Par Dieu, fet il, se ie ne le puis avoir par debonaireté, si l'avrai je par force. » Lors sont venu a la loge ; si entre mesire Yvain tos premiers et la damoisele aprés. Et mesire Yvain ne salue nul de cels de laiens, ains dist si haut que tuit le porent oïr : « Damoisele, venés avant et si prenés vostre esprevier, se vos saiens le poez veoir ; si l'enportez ausi a droit com il en fu portez a tort. -Sire, fet ele, volentiers, ausi le voi je la. » Et ele vient a une perche ou il seoit, si li deslie les giés let l'en volt porter, quant uns chevaliers saut avant qui li dist : « Fuiés. damoisele ! Ne le remués, que par mon chief vos n'en porterés point. Et de tant com vos i estes retornee, avés vos del tot vos pas perdus, que vos ne l'enporterés n'en l'une main n'en l'autre ; et se vos volés avoir oisel, si querés autre, kar a cestui ne vos deduirés vos jamés.</p><p>-Laissiés li, dans chevaliers, fet mesire Yvain, qu'ele l'enportera, et se vos li volés fere force, vos en serés tart al repentir. -Comment ? fet cil. Estes vos ci venus por le deffendre ? -Ce verrois vos bien, fet mesire Yvain, se vos ti tolés. » Et cil giete maintenant le main por le tolir ; et mesire Yvain a trait l'espee et li dist qu'il li coupera le bras, s'il toche plus ne a lui ne a la damoisele. « Voire, fet cil, par mon chief mal le deïstes ! » Lors cort a son hialme et le met en sa teste, et il estoit molt bien armés de tote autre armeure. Maintenant saut en son cheval, kar tos estoit prest, et prent son escu et son glaive et dist a mon seignor Yvain qu'il se gart de lui ; si li laisse corre, son glaive aloigné et mesire Yvain a lui ; si s'entredonent si grans cops sor les escus qu'il les font fendre et percier et les haubers desmaillier et derompre ; si se metent es chars nues les glaives trenchans, si s'entrehurtent des escus et des cors et des visages, si s'entreportent a terre tot enferré. Mesire Yvain est navrés el costé destre et li chevaliers fu ferus par mi le cors si durement qu'il n'a pooir de soi relever de la ou il gist. Et mesire Yvain se redrece a tot le tronçon qui demi li est el costé ; si trait l'espee et s'apareille d'assaillir le chevalier qui le meillor cop li a doné qu'il receust pieça ; et il le cuide tot prest trover de deffendre, si voitqu'il ne se remue, et lors li cort sus et li esrache le hialme de la teste et dist qu'il li coupera le chief sans arest, s'il ne se tient por outré. Et cil parole a grant paine com cil qui molt estoit bleciés, si crie merci et dist : « Ha, frans chevaliers, ne m'ocie mie, mais laisse moi vivretant que j'aie mon Salveor receu, kar je sai bien que je sui navrés a mort ; si vos pri por Dieu que vos alés ci pres desus cest tertre querre un saint home mire qui i maint, et li faites avec lui aporter a corpus Domini. Et il dist que si fera il volentiers ; si commande la damoisele qu'ele s'en aut et ele le fet. Mais ele fet assés greignor duel que devant, kar ele voit .I. chevalier ocis et .I. autre navré, et por si petit d'acheison. Et mesire Yvain vet querre l'ermite, ensi com li chevaliers li ot dit, et li amaine. Et quant il fu revenus arrieres, si trueve iluec .I. escuier et une damoisele qui estoit amie al chevalier et faisoit le greignor duel del monde. Et quant li chevaliers fu confés et il ot receu son Salveor, si le coucha l'en en la loge. Et mesire Yvain s'en vet avec l'ermite et enmaine son cheval en destre, kar a cheval n'i alast il mie delés si haut saintuaire comme Nostre Salveor. Quant il furent venu a l'ermitage que l'en apeloit l'ermitage del Mont, si desarment troi frere qui laiens estoient mon seignor Yvain, et il en i avoit un qui molt savoit de plaies garir; si s'entremist de mon seignor Yvain et s'en prist garde erraument et il osta le tronçon qu'il avoit el costé et s'estanchaé a sainier ; et de cele plaie demora mesire Yvain .XV. jors laiens.</p><p>In the Lanzarote, the episode is really reduced, since it presents only the following text: … y andubieron tanto que llegaron ala Ramada ado estaua el cauallero y la donzella se lo mostro don yban quando le vio dixole señor cauallero yo vos Ruego que dedes su gauilan aesta donzella quele tomastes o sino enla vatalla sodes el gauilan no se lo dare yo dixo el Cauallero mas dela batalla presto so e luego se dexaron correr el vno contrael otro y el cauallero quebro su lança en don yban y don yban lo firio tan de Reçio quelo derribo en tierra todo atordido que no se pudo leuantar e don yban desçendio porle cortar la cabeza mas el le Rogo quele perdonase y que faria todo su mandado y don yban lo dexo y el Cauallero selo agradesçio mucho y dio luego el gauilan ala donzella y ella se fue muy alegre conel y don yban se fue ensu demanda This example shows the need to define what a collatable example is. An abstract is not really alignable with its source, and we need to go through another processing step and not try alignment. This shows the need to produce solid textual relationship typologies before making text processing decisions. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D. Example of alignment table</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E.3. Oscillation of Lanzarote between two groups</head><p>Lanzarote belongs to the second group, but presents sometimes readings from the first one, or even a mixed reading, as we can see in the following table (fragment ii-48): The first group, with Micha, is characterized by the mention of the porte and the presence of four freres whereas the second one, the Sommer's one, is characterized by the absence of the mention of the porte and the presence of only two freres. The text in the Lanzarote presents the mention of a porte and only two freres.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E.4. Example of innovation in Lanzarote</head><p>Lanzarote presents some specific innovations compared to the other witnesses in the following table (fragment ii-48): micha lanzarote puis voit delés les chevaliers une tombe, la plus riche e el asi Catando vio dentro enlos arcos vna muy Rica tumba qui onques fust fete par home, kar ele ert tote d'or fin a chieres pierres precioses qui molt valoient miels d'un grant roialme. Se la tumbe fu de grant bialté, nient ne monte la bialté envers la richece dont ele estoit et avec ce estoit ele la greignor que Lancelos eust onques veue: qual nunca tal viera el ni ome del mundo que hera toda de oro fino e de piedras preçiosas que estaua toda labrada de diuersas maneras e de muy muchas cosas de figuras e de otras cosas que tal tumba de tal manera nunca viera ni ome del mundo que la non ouiese visto no podria asmar la fermosura que enella auia de tales cosas como enella estauan si se merveille molt qui puet estre li princes e don lançarote se marauillo e dixo entresi mesmo Quien podria ser el prinçipe</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E.5. Opposition between Lancellotto and Micha against BnF fr. 111</head><p>We can find a multitude of readings that oppose the group Lancellotto/Micha against BnF fr. 111 (fragment ii-61-1), such as the ones in the following table: </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E.6. Example of innovation in Lancellotto</head><p>Lancellotto presents in some passages innovations (fragment ii-61-2): micha lancellotto et de ceu est il tos esbahis. Aprés regarde la pucele, si se merveille plus assés de sa bialté que del vaissel, e di ciò si maraviglia egli molto, si nè molto isbigotito.</p><p>Ma di cosa che veggia non si maraviglia egli niente inverso della trasgran biltà della damigella che tanto gli sembia esser bella ch'a pena l'osa riguardare, ché più gli è aviso che sua faccia risprenda che sole, e suoi occhi rilucenti e sua capellatura che gli sembia ad essere di fine oro; ve di persona si, come dice in suo cuore, kar onques mes ne vit il feme non ne vide mai niuna F. DMDbscan </p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :Figure 2 :</head><label>12</label><figDesc>Figure 1: Getting pairs of aligned fragments</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: Number of tokens per segment (aligned units, left) and unit (small units automatically produced by the segmenter, right). BERT-based segmentation produces units with a more stable number of tokens and thus yields more semantically coherent aligned segments.</figDesc><graphic coords="11,103.46,221.06,237.53,158.35" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 4 :</head><label>4</label><figDesc>Figure 4: Results of the Branch and bound algorithm on the variant tables produced by the two human annotators and the DBSsan clustering The parsimony values are, for the first annotator, 104; for the second, 135; for the DBScan results, 292.</figDesc><graphic coords="15,197.63,486.21,200.02,200.02" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head></head><label></label><figDesc>alla por aluergar e fallo ala puerta dos frailes et il torne cele part por herbergier. Et quant il i vint, si trueve a la porte IIII. des freres si torne cele part pour herbergier. Et quant il vint la si trouua .ii. freres</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head></head><label></label><figDesc>per miscredenza fallato qua in adietro se nos avons par mescreance foloié ça en arieres se nous auons foloye par mescreance … che trastutti cristiniavano, si confessò, udendo tutto el popolo, … que tuit se crestienoient, si reconuit oiant tot le pueple … que tous se c[re]stienneroit. si cogneut deuant tout le peuple Some variants are more important (fragment ii-48): lancellotto micha fr111 e quelli iscioglie immantanente sua cacciagione e la dona a suo compagnone per portarnela, et cil se destrosse maintenant de sa venoison, si la baille a son compaignon por porter, et cil baille sa uenoison a porter a son compaiguon.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_5"><head>Figure 5 :</head><label>5</label><figDesc>Figure 5: Minimum distances to the MinPts nearest points, sorted by ascending order. Two different densities levels are possible, the first one breaking around 0.2 distance, and the other around 1.2.</figDesc><graphic coords="28,197.63,118.14,200.02,200.02" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1</head><label>1</label><figDesc>A correspondence between syntactic and semantic segments in fragment ii-48, on two following segments. The first one corresponds to the end of the preceding example. Each cell represents an alignment segment</figDesc><table><row><cell>Micha</cell><cell>Sommer</cell><cell>Lanzarote</cell><cell>Lancellotto</cell></row><row><cell>et il torne a destre</cell><cell>et torne a destre en</cell><cell>e dexo el camino</cell><cell>Ed e torna a destra</cell></row><row><cell>vers un chemin viés</cell><cell>vn petit chemin viez</cell><cell>e tomo a mano</cell><cell>inverso uno camino</cell></row><row><cell>et ancien;</cell><cell></cell><cell>derecha vn camino</cell><cell>vecchio e antico,</cell></row><row><cell></cell><cell></cell><cell>pequeño | e biejo</cell><cell></cell></row><row><cell></cell><cell></cell><cell>lleno de yerba</cell><cell></cell></row><row><cell>si ne demora gueres</cell><cell>Si ne demora gaires</cell><cell cols="2">e non andubo mucho si non dimora guari</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 2</head><label>2</label><figDesc>French model results of the segmentation models</figDesc><table><row><cell>Accuracy</cell><cell>Regexp 0.954</cell><cell>BERT-based 0.984</cell></row><row><cell cols="3">None Delimiter None Delimiter Precision 0.940 0.812 0.990 0.874 Recall 0.978 0.611 0.978 0.941 F1-score 0.959 0.670 0.984 0.906</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_4"><head>Table 3</head><label>3</label><figDesc></figDesc><table><row><cell>Italian model results</cell><cell></cell><cell></cell></row><row><cell>Accuracy</cell><cell>Regexp 0.961</cell><cell>BERT-based 0.983</cell></row><row><cell cols="3">None Delimiter None Delimiter Precision 0.937 0.711 0.981 0.827 Recall 0.971 0.523 0.976 0.866 F1-score 0.954 0.602 0.978 0.846</cell></row><row><cell>Table 4</cell><cell></cell><cell></cell></row><row><cell>Castilian model results</cell><cell></cell><cell></cell></row><row><cell>Accuracy</cell><cell>Regexp 0.951</cell><cell>BERT-based 0.981</cell></row><row><cell cols="3">None Delimiter None Delimiter Precision 0.942 0.688 0.981 0.869 Recall 0.962 0.584 0.982 0.863 F1-score 0.952 0.632 0.981 0.866</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_8"><head>Table 8</head><label>8</label><figDesc>Mean Adjusted Rand Index</figDesc><table><row><cell></cell><cell cols="3">Ann. 1 Ann. 2 DBScan</cell></row><row><cell>Annotator 1</cell><cell>1</cell><cell></cell><cell></cell></row><row><cell>Annotator 2</cell><cell>0.70</cell><cell>1</cell><cell></cell></row><row><cell>DBscan</cell><cell>0.04</cell><cell>0.04</cell><cell>1</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_10"><head>Table 11 :</head><label>11</label><figDesc>Example of good quality alignment taken from ii-48 fragment (selected witnesses, no correction performed), with the BERT-based segmentation. Segments are separated with pipe "|" characters.</figDesc><table><row><cell>micha</cell><cell>fr751</cell><cell>inc</cell><cell></cell><cell>lanzarote</cell><cell cols="2">lancellotto</cell></row><row><cell>mais il voit l'eve noire</cell><cell>Mais il uoit leue</cell><cell cols="2">mes il uoit</cell><cell>Mas el agua</cell><cell cols="2">ma e vede</cell></row><row><cell>et parfonde |et si peril-</cell><cell>si parfonde et si</cell><cell>la</cell><cell>riuiere</cell><cell>hera muy fonda</cell><cell>l'acqua</cell><cell>nera</cell></row><row><cell>luse</cell><cell>perilleuse</cell><cell>parfonde</cell><cell>et</cell><cell>e peligrosa e</cell><cell cols="2">e profonda e</cell></row><row><cell></cell><cell></cell><cell cols="2">dangereuse a</cell><cell>negra e bien</cell><cell>perigliosa</cell><cell></cell></row><row><cell></cell><cell></cell><cell>passer.</cell><cell></cell><cell>cuidaua morir</cell><cell></cell><cell></cell></row><row><cell cols="2">qu'il cuide bien noier, que il cuide bien</cell><cell cols="2">et scait bien</cell><cell></cell><cell cols="2">che crede bene</cell></row><row><cell></cell><cell>perir</cell><cell></cell><cell></cell><cell></cell><cell>morire</cell><cell></cell></row><row><cell>s'il se met dedens;</cell><cell>se il se met</cell><cell cols="2">sil entre de-</cell><cell>si se y Metiese</cell><cell cols="2">s'elli|si mette di</cell></row><row><cell></cell><cell>dedans.</cell><cell cols="2">dens|qu il se</cell><cell></cell><cell>dentro;</cell><cell></cell></row><row><cell></cell><cell></cell><cell cols="2">met en peril de</cell><cell></cell><cell></cell><cell></cell></row><row><cell></cell><cell></cell><cell>mort.</cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell>et d'autre part il voit</cell><cell>et dautre part il</cell><cell cols="2">Et daultre part il</cell><cell>e dela otra parte</cell><cell cols="2">e d'altra parte e</cell></row><row><cell>cele</cell><cell>uoit cele</cell><cell>uoit celle</cell><cell></cell><cell>veya la donçella</cell><cell>vede colei</cell><cell></cell></row><row><cell>qui si durement crie</cell><cell>qui si docemet li</cell><cell cols="2">qui si piteuse-</cell><cell>que muy afin-</cell><cell>che|si</cell><cell>dura-</cell></row><row><cell>merci;</cell><cell>prie [mer]ci.</cell><cell cols="2">ment fui crie</cell><cell>cada mente le</cell><cell cols="2">mente gli grida</cell></row><row><cell></cell><cell></cell><cell>mercy.</cell><cell></cell><cell>pedia merçed</cell><cell>mercé,</cell><cell></cell></row><row><cell cols="2">si l'em prent tels pitiés Si len prent tes</cell><cell cols="2">Si luy en prent</cell><cell>e ovo tal piedad</cell><cell cols="2">si ne gli prende</cell></row><row><cell></cell><cell>pitiez</cell><cell cols="2">telle pitie</cell><cell>della</cell><cell>tale piatà</cell><cell></cell></row><row><cell>qu'il en laisse totes</cell><cell>quil en laisse</cell><cell cols="2">quil en laisse</cell><cell>que le fiço todo</cell><cell cols="2">che ne lascia</cell></row><row><cell>poors</cell><cell>totes paours.</cell><cell cols="2">toute paour</cell><cell>el miedo perder</cell><cell cols="2">tutte paure</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_11"><head>Readings and groups of witnesses E.1. Specific readings</head><label></label><figDesc>In some sections, every witness has its own reading (fragment ii-61-1):</figDesc><table><row><cell>micha</cell><cell></cell><cell>sommer</cell><cell></cell><cell>fr751</cell><cell>fr111</cell><cell></cell><cell>inc</cell><cell></cell><cell>lanzarote</cell><cell></cell><cell cols="2">lancellotto</cell></row><row><cell cols="2">Puis traist</cell><cell cols="2">puis traist</cell><cell>Puis trait</cell><cell cols="2">puis trait</cell><cell cols="2">apres trait</cell><cell cols="2">e despues</cell><cell cols="2">Poscia trae</cell></row><row><cell>hors</cell><cell>de</cell><cell cols="2">hors de sa</cell><cell>la piece de</cell><cell>hors</cell><cell>de</cell><cell>hors</cell><cell>de</cell><cell>tiro</cell><cell>la</cell><cell>fuori</cell><cell>di</cell></row><row><cell>sa</cell><cell>char</cell><cell cols="2">quisse le</cell><cell>lespee qui</cell><cell cols="2">sa char la</cell><cell cols="2">sa cuisse</cell><cell cols="2">pieça dela</cell><cell cols="2">sua carne</cell></row><row><cell>la</cell><cell>piece</cell><cell>piece</cell><cell>de</cell><cell>en sa char</cell><cell>piece</cell><cell>de</cell><cell>lespee</cell><cell></cell><cell>espada</cell><cell></cell><cell cols="2">la spada e</cell></row><row><cell cols="2">de l'espee</cell><cell>lespee</cell><cell></cell><cell>estoit.</cell><cell>lespee</cell><cell></cell><cell></cell><cell></cell><cell cols="2">desu pierna</cell><cell>l'apicca:</cell></row><row><cell cols="2">qui dedens</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell>estoit</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell>E.2</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_12"><head>. Examples of readings specific to two main groups</head><label></label><figDesc>The two main groups can be established from common omissions, for example (fragment ii-48):We can also identify common variants that systematically oppose group 1 to group 2. In the following table (fragment ii-48), the two variants oppose shield to weapons, castle to forest.</figDesc><table><row><cell>micha</cell><cell></cell><cell>sommer</cell><cell>fr751</cell><cell></cell><cell>fr111</cell><cell>inc</cell><cell>lanzarote</cell><cell></cell><cell>lancellotto</cell></row><row><cell cols="2">si le font de-</cell><cell></cell><cell cols="2">Si le font</cell><cell>si le font</cell><cell></cell><cell></cell><cell></cell><cell>si l fanno</cell></row><row><cell cols="2">sarmer, ou</cell><cell></cell><cell>desarmer</cell><cell></cell><cell>desarmer</cell><cell></cell><cell></cell><cell></cell><cell>disarmare</cell></row><row><cell cols="2">il volsist ou</cell><cell></cell><cell cols="2">uossist ou</cell><cell>uoulsist il</cell><cell></cell><cell></cell><cell></cell><cell>o volesse O</cell></row><row><cell cols="2">non, et il le</cell><cell></cell><cell cols="2">non. et il le</cell><cell>ou non. et</cell><cell></cell><cell></cell><cell></cell><cell>non, ed elli</cell></row><row><cell cols="2">fist a envis,</cell><cell></cell><cell cols="2">fait mout</cell><cell>il le fist a</cell><cell></cell><cell></cell><cell></cell><cell>il fece ad</cell></row><row><cell></cell><cell></cell><cell></cell><cell>enuiz.</cell><cell></cell><cell>peine.</cell><cell></cell><cell></cell><cell></cell><cell>invidia,</cell></row><row><cell>micha</cell><cell></cell><cell>sommer</cell><cell>fr751</cell><cell></cell><cell>fr111</cell><cell>inc</cell><cell cols="2">lanzarote</cell><cell>lancellotto</cell></row><row><cell cols="2">mes il ne</cell><cell>mais il ne</cell><cell cols="2">mais il ne</cell><cell>mais il ne</cell><cell>mes il ne</cell><cell>mas</cell><cell>no</cell><cell>ma e' non</cell></row><row><cell>portoit mie escu</cell><cell>tel</cell><cell>portoit mie tels armes</cell><cell>portoit mie escu</cell><cell>tel</cell><cell>portoit pas tel escu</cell><cell>portoit mye telles armes</cell><cell>traia tales armas</cell><cell>el</cell><cell>porta mica tale scudo</cell></row><row><cell cols="2">Al chastel, fet ele, de Floego</cell><cell>En forest fait la elle de</cell><cell cols="2">Au chas-tel fait ele de floego.</cell><cell>ou chastel de fleago fait ella</cell><cell>En la for-est de flo-rega:</cell><cell cols="2">enla flo-resta de donseglo-</cell><cell>Al castello, diss'ella,</cell></row><row><cell></cell><cell></cell><cell>floregas.</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell cols="2">rega dixo</cell><cell>di Fleego,</cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell cols="2">la donçella</cell><cell></cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">This article sets apart the identification of macroscopic displacement of large and medium-sized fragments (at paragraph level, for example) that can occur in complex traditions (see subsection 1.2).</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">The first part of the identifier corresponds to the volume, the second to the number of the first segment according to Micha<ref type="bibr" target="#b16">[17,</ref><ref type="bibr" target="#b17">18]</ref> Due to significant textual variation and the current capabilities of the aligner, we have subdivided the longest sections, ii-61 and iv-75, into two and six parts respectively, so that none of the segments exceed 1,000 tokens. Despite this choice, the translations exhibit a significant number of omissions that are difÏcult to detect automatically. For example, Lanzarote contains the episode of the sparrowhawk (segment ii-61-2), but in an extremely shortened version, only a few sentences (Appendix C).</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2">A more recent preprint picks up on this work and compares Bertalign to a new architecture<ref type="bibr" target="#b13">[14]</ref>, but we haven't had time to look at how the algorithm works and how good it performs on our data.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_3">The word-alignment phase, because it requires its own methodology, will not be dealt with in this article: we're only interested in the "macroalignment" phase, at the sentence or syntagm level.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_4">The models which were used are: for the Castilian, the BETO model<ref type="bibr" target="#b7">[8]</ref>; for the French and for the Italian, the models from the MDZ Digital Library<ref type="bibr" target="#b20">[21]</ref>.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_5">We retain a certain number of specific tokens, such as coordinating conjunctions between two clauses, relative pronouns, certain adverbs that begin sentences or clauses, etc., trying to segment appropriately without oversegmenting, for example, when two delimiter tokens appear consecutively (in the clause "et qui trop bien se defrendoient", we only retain the conjunction et).</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="7" xml:id="foot_6">The evaluation was performed on test sets consisting of 4,300 words in Spanish, 1,340 words in French and 2,500 words in Italian.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="8" xml:id="foot_7">Due to the small size of the training corpus, it was decided to produce adhoc models for each of the languages in the corpus. A multilingual model with language metadata injection experiments will be produced and evaluated in an article dedicated to segmentation. These results provide a baseline for future in-depth studies.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="10" xml:id="foot_8">As with all methods using a pivot witness, alignment units in which the pivot omits text (omission or innovation of other branches) must be handled subsequently. Indeed, if the base witness is omitted, the link between the other witnesses is unknown. They must therefore be re-injected, which would require to design a second round of alignment that would change the base witness on the omitted parts. This work has yet to be been produced,</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="11" xml:id="foot_9">For example, Lancellotto doesn't present the text for the beginning of Agloval episode, that leads to a poor alignment, because around 20 alignment units are not present in this witness.</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgments</head><p>Co-funded by the Agence Nationale de la Recherche under the Équipex Biblissima+ (ANR, 21-ESRE-0005).</p><p>Co-funded by the European Union (ERC, LostMA, 101117408). Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Research Council. Neither the European Union nor the granting authority can be held responsible for them.</p><p>The authors additionally wish to thank Florian Cafiero for his thoughts on this research.</p></div>
			</div>

			<div type="annex">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Appendices</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Witnesses</head><p>• lancellotto: Lancellotto, italian witness, edited by Luca Cadioli (cf. bibliography).</p><p>Firenze, Biblioteca della Fondazione Ezio Franceschini, 1 (last quarter of the 14 th century). • lanzarote: Lanzarote, castillan witness, edited by Antonio Contreras Martín and Harvey L. Sharrer (cf. bibliography) . Madrid, Bibioteca Nacional de España, 9611 (16 th , copy of a manuscript dated from 1414). • sommer: edition of H. Oscar Sommer (cf. bibliography).</p><p>Tomes IV and V: London, British Library, Additional 10293 (beginnings of the 14 th century). • micha: edition of Alexandre Micha (cf. bibliography).</p><p>Tome II: Cambridge, Corpus Christi College Library, 45 (2 nd half of the 13 th century). Tome IV: Oxford, Bodleian Library, Rawlinson D. 899 (14 th century). • fr111:</p><p>BnF, français 111 (15 th century). • fr333:</p><p>BnF, français 333 (first quarter of the 14 th ). </p></div>			</div>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond</title>
		<author>
			<persName><forename type="first">M</forename><surname>Artetxe</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Schwenk</surname></persName>
		</author>
		<idno type="DOI">10.1162/tacl\_a\_00288</idno>
	</analytic>
	<monogr>
		<title level="j">Transactions of the Association for Computational Linguistics</title>
		<imprint>
			<biblScope unit="volume">7</biblScope>
			<biblScope unit="page" from="597" to="610" />
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Qohelet Euporia: A Domain Specific Language to Annotate Multilingual Variant Readings</title>
		<author>
			<persName><forename type="first">L</forename><surname>Bambaci</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Boschetti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Del Gratta</surname></persName>
		</author>
		<idno type="DOI">10.1109/cist.2018.8596332</idno>
	</analytic>
	<monogr>
		<title level="m">IEEE 5th International Congress on Information Science and Technology (CiSt)</title>
				<imprint>
			<date type="published" when="2018">2018. 2018</date>
			<biblScope unit="page" from="266" to="269" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<title level="m" type="main">Machine-Assisted Multilingual Alignment of the Old Church Slavonic Codex Suprasliensis</title>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">J</forename><surname>Birnbaum</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><forename type="middle">M</forename><surname>Eckhoff</surname></persName>
		</author>
		<ptr target="https://www.researchgate.net/profile/Nada-Sabec/publication/371120502%5C%5FToward%5C%5FLess%5C%5FFormal%5C%5FWays%5C%5Fof%5C%5FAddressing%5C%5Fthe%5C%5FOther%5C%5Fin%5C%5FSlovene/links/6473967b59d5ad5f9c803dac/Toward-Less-Formal-Ways-of-Addressing-the-Other-in-Slovene.pdf" />
		<editor>S. M. Dickey and M. R. Lauersdorf</editor>
		<imprint>
			<date type="published" when="2018">2018</date>
			<biblScope unit="page" from="1" to="15" />
			<pubPlace>Indiana; Bloomington</pubPlace>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<title level="m" type="main">Arlima -Archives de littérature du Moyen Âge</title>
		<author>
			<persName><forename type="first">L</forename><surname>Brun</surname></persName>
		</author>
		<ptr target="https://www.arlima.net/" />
		<imprint>
			<date type="published" when="2005">2005</date>
			<pubPlace>Ottawa</pubPlace>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<title level="m" type="main">Lancellotto: versione italiana inedita del Lancelot en prose</title>
		<author>
			<persName><forename type="first">L</forename><surname>Cadioli</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2016">2016</date>
			<publisher>Edizioni del Galluzzo</publisher>
			<pubPlace>Firenze</pubPlace>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Le Lancellotto italien</title>
		<author>
			<persName><forename type="first">L</forename><surname>Cadioli</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">La matière arthurienne tardive en Europe, 1270-1530 = Late Arthurian Tradition in Europe</title>
				<meeting><address><addrLine>Rennes</addrLine></address></meeting>
		<imprint>
			<publisher>Presses universitaires de Rennes</publisher>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="501" to="510" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Stemmatology: An R Package for the Computer-Assisted Analysis of Textual Traditions</title>
		<author>
			<persName><forename type="first">F</forename><surname>Cafiero</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J.-B</forename><surname>Camps</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Second Workshop on Corpus-Based Research in the Humanities CRH-2</title>
				<editor>
			<persName><forename type="first">Andrew</forename><forename type="middle">U</forename><surname>Frank</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Christine</forename><surname>Ivanovic</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Francesco</forename><surname>Mambrini</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Marco</forename><surname>Passarotti</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Caroline</forename><surname>Sporleder</surname></persName>
		</editor>
		<meeting>the Second Workshop on Corpus-Based Research in the Humanities CRH-2<address><addrLine>Vienna, Austria</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2018-01">January 2018. 2018</date>
			<biblScope unit="page" from="25" to="26" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Spanish Pre-Trained BERT Model and Evaluation Data</title>
		<author>
			<persName><forename type="first">J</forename><surname>Cañete</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Chaperon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Fuentes</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J.-H</forename><surname>Ho</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Kang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Pérez</surname></persName>
		</author>
		<ptr target="https://arxiv.org/abs/2308.02976" />
	</analytic>
	<monogr>
		<title level="m">PML4DC at ICLR</title>
				<imprint>
			<date type="published" when="2020">2020. 2020</date>
			<biblScope unit="page" from="1" to="9" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<monogr>
		<title level="m">Lanzarote del Lago. Alcalá de Henares</title>
				<editor>
			<persName><forename type="first">A</forename><surname>Contreras Martıń</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">H</forename><forename type="middle">L</forename><surname>Sharrer</surname></persName>
		</editor>
		<imprint>
			<publisher>Centro de estudios cervantinos</publisher>
			<date type="published" when="2006">2006</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">A dynamic method for discovering density varied clusters</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">T</forename><surname>Elbatta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">M</forename><surname>Ashour</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">International Journal of Signal Processing, Image Processing and Pattern Recognition</title>
		<imprint>
			<biblScope unit="volume">6</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="123" to="134" />
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">A density-based algorithm for discovering clusters in large spatial databases with noise</title>
		<author>
			<persName><forename type="first">M</forename><surname>Ester</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H.-P</forename><surname>Kriegel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Sander</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Xu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">kdd</title>
		<imprint>
			<biblScope unit="volume">96</biblScope>
			<biblScope unit="page" from="226" to="231" />
			<date type="published" when="1996">1996</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Branch and bound algorithms to determine minimal evolutionary trees</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">D</forename><surname>Hendy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Penny</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Mathematical biosciences</title>
		<imprint>
			<biblScope unit="volume">59</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page" from="277" to="290" />
			<date type="published" when="1982">1982</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<monogr>
		<author>
			<persName><surname>Irht</surname></persName>
		</author>
		<author>
			<persName><surname>Jonas</surname></persName>
		</author>
		<ptr target="http://jonas.irht.cnrs.fr/" />
		<title level="m">Répertoire des textes et des manuscrits médiévaux d&apos;oc et d&apos;oıl</title>
				<meeting><address><addrLine>Paris et Orléans</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<monogr>
		<title level="m" type="main">Adaptative Bilingual Aligning Using Multilingual Sentence Embedding</title>
		<author>
			<persName><forename type="first">O</forename><surname>Kraif</surname></persName>
		</author>
		<idno>arXiv, 2024</idno>
		<ptr target="http://arxiv.org/abs/2403.11921" />
		<imprint/>
	</monogr>
	<note type="report_type">preprint</note>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Bertalign: Improved Word Embedding-Based Sentence Alignment for Chinese-English Parallel Corpora of Literary Texts</title>
		<author>
			<persName><forename type="first">L</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Zhu</surname></persName>
		</author>
		<idno type="DOI">10.1093/llc/fqac089</idno>
	</analytic>
	<monogr>
		<title level="j">Digital Scholarship in the Humanities</title>
		<imprint>
			<biblScope unit="volume">38</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page" from="621" to="634" />
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<monogr>
		<title level="m" type="main">Automated Alignment of Medieval Text Versions Based on Word Embeddings</title>
		<author>
			<persName><forename type="first">C</forename><surname>Meinecke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">J</forename><surname>Wrisley</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Jänicke</surname></persName>
		</author>
		<idno type="DOI">10.31219/osf.io/tah3y</idno>
		<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
	<note type="report_type">Open Science Framework</note>
</biblStruct>

<biblStruct xml:id="b16">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><surname>Micha</surname></persName>
		</author>
		<title level="m">Lancelot: roman en prose du XIIIe siècle. Tome II</title>
				<meeting><address><addrLine>Genève</addrLine></address></meeting>
		<imprint>
			<publisher>Droz</publisher>
			<date type="published" when="1978">1978</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><surname>Micha</surname></persName>
		</author>
		<title level="m">Lancelot: roman en prose du XIIIe siècle. Tome IV</title>
				<meeting><address><addrLine>Genève</addrLine></address></meeting>
		<imprint>
			<publisher>Droz</publisher>
			<date type="published" when="1979">1979</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">El banco de datos de la RAE: CREA y CORDE</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">Sánchez</forename><surname>Sánchez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">Domıńguez</forename><surname>Cintas</surname></persName>
		</author>
		<ptr target="https://dialnet.unirioja.es/servlet/articulo?codigo=2210249" />
	</analytic>
	<monogr>
		<title level="m">Per Abbat: boletıń filológico de actualización académica y didáctica</title>
				<imprint>
			<date type="published" when="2007">2007</date>
			<biblScope unit="volume">2</biblScope>
			<biblScope unit="page" from="137" to="148" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">La Tradizione Manoscritta Del &quot;Livre Du Gouvernement Des Roys et Des Princes</title>
		<author>
			<persName><forename type="first">G</forename><surname>Scala</surname></persName>
		</author>
		<idno type="DOI">10.5167/uzh-202620</idno>
	</analytic>
	<monogr>
		<title level="m">Studio Filologico e Saggio Di Edizione</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
		<respStmt>
			<orgName>University of Zurich</orgName>
		</respStmt>
	</monogr>
	<note type="report_type">PhD thesis</note>
	<note>Di Henri de Gauchy</note>
</biblStruct>

<biblStruct xml:id="b20">
	<monogr>
		<title level="m" type="main">Europeana BERT and ELECTRA models</title>
		<author>
			<persName><forename type="first">S</forename><surname>Schweter</surname></persName>
		</author>
		<idno type="DOI">10.5281/zenodo.4275044</idno>
		<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<monogr>
		<author>
			<persName><forename type="first">H</forename><forename type="middle">O</forename><surname>Sommer</surname></persName>
		</author>
		<title level="m">The Vulgate Version of the Arthurian Romances</title>
				<meeting><address><addrLine>Washington</addrLine></address></meeting>
		<imprint>
			<publisher>Carnegie Institution of Washington</publisher>
			<date type="published" when="1911">1911</date>
			<biblScope unit="volume">4</biblScope>
		</imprint>
	</monogr>
	<note>Le Livre de Lancelot del Lac, part II</note>
</biblStruct>

<biblStruct xml:id="b22">
	<monogr>
		<author>
			<persName><forename type="first">H</forename><forename type="middle">O</forename><surname>Sommer</surname></persName>
		</author>
		<title level="m">The Vulgate Version of the Arthurian Romances</title>
				<meeting><address><addrLine>Washington</addrLine></address></meeting>
		<imprint>
			<publisher>Carnegie Institution of Washington</publisher>
			<date type="published" when="1912">1912</date>
			<biblScope unit="volume">5</biblScope>
		</imprint>
	</monogr>
	<note>Le Livre de Lancelot del Lac, part III</note>
</biblStruct>

<biblStruct xml:id="b23">
	<monogr>
		<title level="m" type="main">The Complete Medieval Dreambook. A Multilingual, Alphabetical Somnia Danielis Collation</title>
		<author>
			<persName><forename type="first">F</forename><surname>Steven</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename></persName>
		</author>
		<imprint>
			<date type="published" when="1978">1978</date>
		</imprint>
		<respStmt>
			<orgName>University of Michigan</orgName>
		</respStmt>
	</monogr>
	<note type="report_type">PhD thesis</note>
</biblStruct>

<biblStruct xml:id="b24">
	<analytic>
		<title level="a" type="main">Bitext Alignment</title>
		<author>
			<persName><forename type="first">J</forename><surname>Tiedemann</surname></persName>
		</author>
		<ptr target="http://gen.lib.rus.ec/book/index.php?md5=7fb8c6d1d4e8924f79a03026e23b6517" />
	</analytic>
	<monogr>
		<title level="m">Synthesis Lectures on Human Language Technologies</title>
				<imprint>
			<publisher>Morgan &amp; Claypool Publishers</publisher>
			<date type="published" when="2011">2011</date>
		</imprint>
	</monogr>
	<note>1st ed</note>
</biblStruct>

<biblStruct xml:id="b25">
	<analytic>
		<title level="a" type="main">Displaying Textual and Translational Variants in a Hypertextual and Multilingual Edition of Shakespeare&apos;s Multi-text Plays</title>
		<author>
			<persName><forename type="first">J</forename><surname>Tronch</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Medieval and Renaissance Texts and Studies 502</title>
				<editor>
			<persName><forename type="first">L</forename><surname>Estill</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">D</forename><forename type="middle">K</forename><surname>Jakacki</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><surname>Ullyot</surname></persName>
		</editor>
		<meeting><address><addrLine>Toronto, Ontario</addrLine></address></meeting>
		<imprint>
			<publisher>Iter Press</publisher>
			<date type="published" when="2016">2016</date>
			<biblScope unit="page" from="92" to="116" />
		</imprint>
	</monogr>
	<note>Early Modern Studies after the Digital Turn</note>
</biblStruct>

<biblStruct xml:id="b26">
	<analytic>
		<title level="a" type="main">Interactive Optimization of Embedding-Based Text Similarity Calculations</title>
		<author>
			<persName><forename type="first">D</forename><surname>Witschard</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Jusufi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">M</forename><surname>Martins</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Kucher</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Kerren</surname></persName>
		</author>
		<idno type="DOI">10.1177/14738716221114372</idno>
	</analytic>
	<monogr>
		<title level="j">Information Visualization</title>
		<imprint>
			<biblScope unit="volume">21</biblScope>
			<biblScope unit="issue">4</biblScope>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b27">
	<analytic>
		<title level="a" type="main">Transformers: State-of-the-Art Natural Language Processing</title>
		<author>
			<persName><forename type="first">T</forename><surname>Wolf</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Debut</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Sanh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Chaumond</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Delangue</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Moi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Cistac</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Rault</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Louf</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Funtowicz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Davison</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Shleifer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">V</forename><surname>Platen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Ma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Jernite</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Plu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">L</forename><surname>Scao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Gugger</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Drame</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Lhoest</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">M</forename><surname>Rush</surname></persName>
		</author>
		<ptr target="https://www.aclweb.org/anthology/2020.emnlp-demos.6" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Online: Association for Computational Linguistics</title>
				<meeting>the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Online: Association for Computational Linguistics</meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="38" to="45" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
