The birth of French orthography. A computational analysis of French spelling systems in diachrony ⋆

The birth of French orthography. A computational analysis of French spelling systems in diachrony ⋆ SimonGabay simon.gabay@unige.ch Inria Centre de Recherche de Paris ThibaultClérice thibault.clerice@inria.fr Inria Centre de Recherche de Paris Université de Genève The birth of French orthography. A computational analysis of French spelling systems in diachrony ⋆ 1613-0073 3B67951E1D39D722B055FAB3A12D5550 GROBID - A machine learning software for extracting information from scholarly documents Computational linguistics History of orthography Information extraction Corpus building

The 17th c. is crucial for the French language, as it sees the creation of a strict orthographic norm that largely persists to this day. Despite its significance, the history of spelling systems remains however an overlooked area in French linguistics for two reasons. On the one hand, spelling is made up of micro-changes which requires a quantitative approach, and on the other hand, no corpus is available due to the interventions of editors in almost all the texts already available. In this paper, we therefore propose a new corpus allowing such a study, as well as the extraction and analysis tools necessary for our research. By comparing the text extracted with OCR and a version automatically aligned with contemporary French spelling, we extract the variant zones, we categorise these variants, and we observe their frequency to study the (ortho)graphic change during the 17th century.

Introduction

The grapho-phonetic aspects of French during the 17th c. paradoxically remain very poorly known, despite the importance of the graphematic question at this period, which saw the appearance of the French orthography 1 . Rather than the actual practice of scriptors, it is the depth of theoretical debates on spelling that has until now concentrated most of research (e.g. [10] or [7]), and the notebooks of Mezeray [1] or the Remarques of Vaugelas [56,57] still remain among the main sources used, rather than statistical surveys on vast corpora.

If the various dialects and other scriptae populating Old and Middle French have been abundantly described (e.g. in [15]), just like the "orthographie" of the Renaissance (to quote the term used by Baddeley [3]), the slow imposition of an orthographic norm throughout modern times, although a major phenomenon in the history of a language as prescriptive as French, remains a blind spot in diachronic linguistics. How has the French that we know today supplanted its various modern variations? One of the main technical challenges for carrying out such a study relies on the existence of important amounts of data, in order to guarantee quantitatively the reliability of the results. Unfortunately, such corpora of classical French do not exist for two reasons. On the one hand, as Cl. Vachon [55, p. 32, n. 31] bitterly experienced, text editors got into the habit of standardising the language of that era [22,16], which makes its study particularly complicated, if not impossible to use for graphematic studies. On the other hand, the few corpora that have been created, like that of Cl. Vachon, but also others like that of the Réseau Corpus Français Préclassique et Classique (RCFC) [2] are not, or not in full, available to researchers. Given the ever-increasing quantities of data necessary for computational studies, it is however dubious that these two corpora, even freely accessible, would in any case remain insufÏcient for the most recent approaches proposed in NLP.

This paper proposes to return to the history of the French vêtement graphique ("graphic clothing") in a computational way. We introduce a two-step approach: first, a unique corpus creation pipeline meticulously extracts spelling information from digital facsimiles. This pipeline includes a layout analysis model to distinguish text from paratext on the page, an OCR model that retains the historical character ‹ſ›, pivotal to written French, and a linguistic normaliser that "translates" historical French into its contemporary counterpart at the sentence level. In the second step, we analyse the created corpus using a comparison algorithm that matches the extracted historical text with its modern equivalent at the character level. This enables us to pinpoint significant variations, categorise these differences, and uncover detailed trends throughout the 17th century. This methodological framework not only enhances our understanding of historical French orthography, but also proposes a new approach for computational linguistic studies of spelling variation.

State of the art

Corpus building from OCR has long been a task in digital humanities and corpus linguistics. Initially deemed unsuitable for historical sources in 1993 [44], OCR gained credibility in the late 1990s for corpus building, including XML TEI formalisation in commercial projects such as the Patrologia Latina Database, and for Ancient Greek scripts in the 2010s [49]. Most project using TEI, such as the First1KGreek project, relied on manual formalisation of the text's logical structure [40], as manual work was considered essential for accuracy. The advent of user-friendly OCR and HTR technologies has spurred interest in automatic document formalisation (ADF), primarily focused on facsimile formalisation [52] and noisy text removal with tools based on vocabularies such as SegmOnto [25], which standardises the identification of paratextual zones (running titles, footnotes, etc.). Few projects, however, have utilised font, geometric, and textual features to reconstruct or emulate the original text structure from borndigital PDFs or OCR outputs. PaperXML [51] demonstrated such transformation but was limited to the ACL Anthology structure. Grobid [50] and Grobid Dictionaries [35,34] employed geometric, font, and textual features to produce XML TEI output, though they were specific to scientific papers and dictionaries. In 2022, visual features outperformed linguistic ones in document formalisation, with YOLO models using the SegmOnto controlled vocabulary surpassing LayoutLM models in multilingual settings [41]. Recently, research has started on OCR output formalisation for corpus building with a controlled vocabulary and a training dataset for models [33,45]. Lastly, the Layout Analysis Dataset with SegmOnto (LADaS) [13] allowed a much finer granularity in the analysis, and a significant improvement of the entire pipeline for the automatic creation of files encoded in XML-TEI that goes beyond facsimile approach and closer to reproducing the logical structure of the text.

Linguistic Normalisation (LN) has a long history, dating back to the 80's [17], but has developed itself as derived task from Machine Translation (MT) in the beginning of the 2010's, usually to improve downstream tasks in the pipeline such as linguistic annotation [54]. LN share important similarities with MT, and therefore relies on the same methods, but with a slightly different objective: to "translate" a source into another state of the language, usually more recent (16th c. German→ contemporary German), rather than into another language (Italian → German). Resources existed first for Slovene, German, English, Hungarian, Spanish, Swedish, Portuguese [8], but several studies have recently improved both resources [21] and techniques for historical French, first comparing rule-based, statistical and neural methods [23], and then alignment-based and neural MT-approaches [6].

Computational scriptology is based on the notion of scripta, coined by Remacle [48] and widely used in Romanistics to to distinguish a spoken language (the dialect) and a written language (the scripta). The first studies on dialectometry date back from the early 70's with the pioneer work of Jean Séguy, who invented the term dialectométrie [53], on the distance between dialects in vast corpora [48]. Since then, two main schools, based in Salzburg [30] and Groningen [43], have advanced research on the topic, but relying mainly on geographical data to localise dialects. In parallel to these research, Cl. Vachon has changed the approach, switching to corpus-based research, using historical data to study semi-automatically the spelling [55], and more recently, J.-B. Camps has shifted the method, using unsupervised stylometry to categorise medieval scriptae [9]. Regarding modern French, alternative studies have proposed alignementbased approaches to compare the historical source and an automatically normalised version to detect the evolution of spellings [24] or to categorise documents [28].

Corpus building

Data

For practical reasons, a first corpus of limited size (c. 600 texts) spanning the 17th c. was produced with our pipeline. The data comes from the Gallica digital library and contains only French-language documents. For our experiment, we have selected only plays, which offer medium size documents (compared to novels, potentially much longer), and linguistically homogeneous data (spelling can influenced by the genre, such as legal documents which tend to use more "archaic" traits and may involve Latin phrases).

Method

Our pipeline allows us to extract data, enrich it and store it in a standard format (cf. fig. 1). Firstly, we apply a layout analysis model specialised in theatrical data trained for the occasion, then we use an OCR model prepared for this study which preserves the long s (‹ſ›). Based on the layout analysis we convert ALTO files to TEI files. Only textual data which contains text of the work (paragraph, speech, verse, etc.), and not linked to the structure of the book (running title, page number, quire marks, etc.) is extracted and normalised automatically, before to be reintroduced into the TEI file.

Table 1

Training and evaluation data for Layout Analysis across the datasets.

Each image represent a single document in the dataset. Layout analysis. Based on the results of Najem-Meyer and Romanello [41] and the initial evaluation of YOLO region segmenter against Kraken's [36] as a region segmenter with YALTAi [12], we proposed to evaluate the ability of YOLOv8 [47] to detect regions in our 17th c. print corpus.

For this purpose, we annotated one random image from each digitised version of our corpus, which could include empty pages (e.g., bookbinding, cover) and full pages. This resulted in a corpus of 620 images for training, evaluation, and testing. Our final corpus comprises 32 null pages (without annotations) and a variety of annotations, with a majority of speech-related tags (MainZone:SP, MainZone:SP#Continued), paratextual-related objects (e.g., Number-ingZone, RunningTitleZone), a smaller number of logical structuring features such as scene titles (MainZone:Head) and cast lists (MainZone:Entry), as well as a few paragraphs and poetic excerpts, mainly found in incipits or prefaces of the books (MainZone:P), as seen in tab. 1. Optical character recognition. Since YOLO is well integrated within YALTAi, which in turn works seamlessly with Kraken, we decided to use the latter to train a new OCR model that includes the long s (‹ſ›). Kraken, unlike other OCR system, avoids the integration of a strong language model which in turn, for our purpose, allows for keeping more variations. This new model, derived from CATMuS Print [26], uses three datasets for fine-tuning [19,20,29] and one evaluation dataset [27] (cf. tab. 2). We evaluate on a test set that includes data spanning three centuries (from the 16th to the 18th) and comprises one page from 10 different documents for each period.

TEI Document production. Document formalisation follows a logical approach based on the ALTO output produced by Kraken and YALTAi, rather than a neural one. Each region is processed in reading order, with regions not matching MainZone being ignored, except for the "default" region, which handles orphan lines. The default region is placed into a <fw> ("forme work") tag, which is typically excluded from our text export processes. Regions marked as #Continued are logically merged with previous ones. Each line is prepended by a TEI <lb/> (line beginning) tag to facilitate back-to-document correction capabilities. Hyphenisation is resolved by removing hyphen but keeping the <lb/> tag at its place. 2 While machine learning is employed for initial region detection, the formalisation process itself does not involve any learned behaviour. Metadata are systematically integrated in the <teiHeader>, using information automatically retrieved from the catalogue of the French National Library via the ark ID.

Linguistic normalisation. All documents are processed via a normaliser previously trained 3 . Only text contained in <p> and <sp> ("speech") elements are kept for normalisation, because a specific spelling variation occurring in the running title, for instance, would be repeated every two pages and potentially alter artificially the result of the scriptometric analysis. The text is split into sentences (ending by a full stop, an exclamation or a question mark) or subsentences (ending by a colon or a semicolon), all stored in a <seg> ("arbitrary segment") element, with the source text in <orig> ("original form") and the automatically normalised text in <reg> ("regularization"). The normalised version is evaluated against a dictionary of modern French to control the quality of the final product.

Experimental Setup and Evaluation

Layout analysis. We evaluate two possible setups: both use fine-tuning with the original YOLOv8L models and an input image size of 960 pixels (higher than the default). One setup uses only the dataset produced in the context of this paper, while the other merges this dataset with the larger LADaS dataset (5,000 images). We train both setups for 100 epochs with otherwise default parameters. Since our study focuses exclusively on the MainZone, which contains the primary text and excludes all paratextual elements (such as decorations, page numbers, and running titles), we have concentrated our evaluation on this specific zone. Overall, when considering all classes, we found that integrating our data with the LADaS corpus yields improved results (0.768 vs. 0.8). However, for the most critical classes (Sp and Sp-continued), the model trained exclusively with theatrical data produces slightly better outcomes. As previously mentioned, these are the classes essential for our study. Text recognition. To fine-tune and adapt the CATMuS Print OCR model to the allographic variation of round s/long s, we modified the classifier codec (-resize new mode) and used a standard learning rate of 0.0001, along with a batch size of 32. This logical approach ensures the model is fine tuned to the specific typographic variations without relying on any learned behaviour during the formalisation process. We compare this approach to a model without fine-tuning, trained from scratch, with the same architecture (cf. tab. 4), revealing the superiority of the approach with fine tuning. Most of the errors are errors related to poor segmentation of the text (cf. tab. 5), in which there should be a space that is missing from the prediction -a classic error for historical prints. The prediction errors regarding two types of apostrophes (curved or straight) are of little concern because they do not affect the result from a linguistic point of view and are due to poor data preparation that is easily correctable. The confusion between the round s and the long s is likely attributable to the fine-tuning process and the absence of the long s in the base model.

Linguistic normalisation

To evaluate the results of the normalisation, we compare the prediction of the normaliser with a dictionary of contemporary French to obtain a Word Accuracy (WAcc). Results are satisfactory (cf. fig. 3), with a median above 90%. Texts with a WAcc under 80% are removed to avoid using unreliable data.

Result dataset

The final dataset is made of around 80,000 pages for 620 documents. While the number of unit is uneven over the years (cf. fig. 5a), the accumulated tokens are progressing evenly (cf. fig. 5b). An example of our TEI encoding is presented in fig. 4

Evaluation of spelling variation 4.1. Method

We use the ABA [46] tool to precisely identify the portions of words which differ between the original version and the normalised version, and group similar differences, for example having the same historical-linguistic origin, or the same type of operations in terms of addition, deletion or modification of characters. Each <orig> and <reg> of the corpus is split into words, the punctuation is removed, and then the original and normalised versions are aligned at the word level using the Needleman-Wunsch [42] algorithm, using the Levenshtein distance [39] between each pair of words in the same <seg> in the original and normalised version 4 .

Secondly, for each of the aligned word pairs, the original version and the normalised version are aligned at the character level, still using the Needleman-Wunsch algorithm, but using a specific substitution matrix to allow not only identical letters to be aligned, but also letters considered close in (pre)classical French and contemporary French (presence/absence of diacritic, ligatures…). For example, while identical letters benefit from a substitution score of 4, letters differing only in accent or cedilla benefit from a score of 2, as do ‹ſ› and ‹s› or ‹s› and ‹ß› for example. Other pairs of letters benefit from a score of 1, such as ‹u› and ‹v›, ‹s› and ‹z› or even ‹n› and ‹m ›. Conversely, a score of -1 is assigned to pairs of distinct letters not subject to such exceptions, as well as to the deletion or insertion of a character. The arrows indicate the previous box on the optimal path to calculate the similarity between two prefixes, one from the word on the first row, the other from the word in the first column.

On this optimal path, green indicates equality, red indicates substitution, and blue indicates deletion.

A p o ſ t r e A ↘ 4 → 3 → 2 → 1 → 0 → -1 → -2 p ↓ 3 ↘ 8 → 7 → 6 → 5 → 4 → 3 ô ↓ 2 ↓ 7 ↘ 10 → 9 → 8 → 7 → 6 t ↓ 1 ↓ 6 ↓ 9 ↓ 8 ↘ 13 → 12 → 11 r ↓ 0 ↓ 5 ↓ 8 ↓ 7 ↓ 12 ↘ 17 → 16 e ↓ -1 ↓ 4 ↓ 7 ↓ 6 ↓ 11 ↓ 16 ↘ 21

This execution of the Needleman-Wunsch algorithm to obtain character-level alignment is illustrated in the matrix in tab. 6, where each number represents the similarity score of the best alignment found between the prefix of ‹Apoſtre› and ‹Apôtre› up to this box. It is preceded by an arrow indicating which box to come from to obtain this best alignment. For example, to obtain the best alignment between ‹Apoſ› and ‹Apô›, we must consider the best alignment between ‹Apo› and ‹Apô› (which has a score of 10) then make an insertion of , which has a score of -1, which provides a total score of 9. If we had preferred to first consider the best alignment between ‹Apoſ› and ‹Ap›, which has a score of 6, then delete the ô, which has a score of -1, we would have obtained an alignment with a score of 5, therefore lower than optimal. In case of insertion or deletion during this alignment step, we use the ¤ character in order to obtain two words of the same length in both the original and normalised version. Thus, at the end of this second alignment step, the word Apo tre in the original version is matched with apô¤tre in a normalised version to obtain character-by-character alignment.

Finally, for each word in the corpus, its original and normalised versions are analysed, char-acter by character, to detect, in the case of different characters at the same position, the normalisation rule that applies, or to signal that no existing rule was identified when appropriate. 72 rules were defined based on the bibliography and the differences observed in the gold FreEM norm parallel corpus [21]. For example, the rule Ramist letter is detected if an ‹i›, a ‹j›, an ‹u› or a ‹v› is present in the associated original word respectively to a ‹j›, an ‹i›, a ‹v› or an ‹u› in the normalised version.

Results

Figure 6: Disappearance of ‹gn›.

Based on the alignments obtained using the Needleman-Wunsch algorithm and the detections of the 72 rules mentioned earlier, our analysis reveals four distinctive patterns of historical spelling changes. The principle underlying this analysis is straightforward: if a normalisation rule is detected less frequently, it indicates that the historical spelling it targets is becoming less prevalent in the corpus. To examine its evolution throughout the century, we normalize the total number of rule applications to its percentage within each text. For instance, the etymological spelling ‹gn›, found in form cognoitre (<lat. cognoscere), is less and less replace by ‹nn› (today connaître, eng. "to know"), signifying the slow disappearance of this spelling (cf. fig. 6). Pattern A: constant rate. Using ABA, it is possible to detect more complex traits of historical graphic systems than the specific use of a single letter (e.g. ‹u› vs ‹v› as a vowel) or a group of letters (‹gn› vs ‹nn›), such as the presence of a diacritical letter to change the sound-value of the letter to which it is added (e.g. vowel + ‹s›). In historical French, the phoneme [e] is thus regularly noted with the grapheme ‹es› where today we use ‹é› (estat vs état, eng. "state"), and the phoneme [ä] is noted ‹as› where we now find ‹â› (pasturage vs pâturage, eng "pasture"). If counting the presence of ‹v› followed by consonant (‹vne› = [yn]) to identify the historical use of ‹v› is enough, it is impossible to count the occurrences of ‹es› to measure the presence of a diacritical s (‹esponge› → ‹éponge›, eng. "sponge", but ‹espagnol› → ‹espagnol› and not ‹épag-nol›, eng. "spanish"): the transition from a complex grapheme (such as a digraph) to a simple grapheme requires an alignment at the character level of the original text and its normalised version, and then the deduction of the spelling change from the difference between the two. In our corpus, we detect a clear decrease in the use of complex graphemes with a diacritical s, whether the latter is combined with ‹e› (cf. fig. 7a) or with ‹a› (cf. fig. 7b). Interestingly, the propagation of these two new spellings (vowel+accent) does occur at a very similar speed (cf. fig. 8a), recalling Kroch's constant rate hypothesis (cf. fig. 8b) 5 , of which researchers have already found traces in syntactic [59] and phonological [18] change. Pattern B: abrupt change. On the basis of such observations, it is however possible to go further and date the moment when a break occurs in the scribal practice, to date the moment when the spelling changes. To do so, we can use binary segmentation (BS) [58,4], an algorithm using a forward stepwise method, to identify change-point detection. This method has already been used in diachronic linguistic to study the sudden introduction of new lexical items [37]. One of the main discoveries of our study is the extremely abrupt nature of certain changes, which take place at very high speed, such as the disappearance between 1668 and 1672 of Ramist letters (cf. fig. 9a), as proposed by Christophe Plantin in the 16th c. [11] and defended by Pierre Corneille in his foreword au lecteur of 1663 [14]. A similar phenomenon, although slightly less abrupt, exists for the disappearance of the etymological ‹c› followed by ‹t› (e.g. ‹faict›<factum, today fait, eng. "fact") at the end of the 1630s (cf. fig. 9b). It is indeed faster to compose the word eſtoit with the ligature (e+ſt+o+i+t=4 characters) than without (e+ſ|s+t+o+i+t=5 characters). One could argue that switching to the accented letter also requires only four characters (é+t+o+i+t=4 characters), but if the ligatures are present in number in the printer's type case, the accented characters are less so. Our working hypothesis is as follows: as ligatures are largely composed of a long s (‹ſ›), we should obtain a correlation between the use of this s (cf. fig. 10a) and the acute accent (cf. fig. 10b). We evaluate the correlation between the evolution of the two phenomena over time, and obtain a Pearson product-moment correlation coefÏcient of 0.365 with a p-value of 4.88e-20, which indicate a good correlation (cf. fig. 11). Pattern D: innovation. Finally, it is important to note that, in this slow movement of standardisation that we are drawing, innovations also appear. These innovations concern a lot diacritics, some of which are exploding in number like the diaeresis (cf. fig. 10a): scriptors tend to add them more and more on one of the two hiatus vowel, especially with the sequence ‹ue› (louër or loüer, today louer, eng. "to rent/to praise"). We also note a great hesitation regarding the notation of nasal vowels (cf. fig. 12b), especially [ã], for which we can use ‹en|m› or ‹an|m› such as aventure vs avanture (today aventure, eng. "adventure").

Conclusion and further work

The spelling of the 17th c. is changing throughout the century, and at the beginning of the 18th century the standardisation process is very advanced (etymological letters, use of diacritical letters, historical use of ‹u› and ‹i›, etc.), as studies on other languages, such as Polish [31] or English [5], have been able to demonstrate.We still observe, however, a certain instability, which concerns more minor hesitations than anything else (notation of nasal vowels, hiatus vowel, etc): although the standardisation process is advanced, it is not yet finished.

Table 7

Main spelling change and their dating using change-point detection. Among all the changes in spelling, all those observed seem to follow the traditional shape of the scurve, and no "anomalies" have yet been found as it have been the case elsewhere [31]: on the contrary, we even think we fond new evidences supporting Kroch's constant rate hypothesis.

The idea of a slow change, which spreads over a long period [5], seems to be confirmed by our analyses (cf. tab. 7). However, the velocity of change varies greatly from one phenomenon to another, with sometimes slow shifts over decades, or sometimes abrupt ruptures whose cause is not entirely clear, and which would be interesting to discover.

As for the reasons for the change, a lot of work still needs to be done, particularly in trying to find features that could predict the change [32]. One of them, the identity of the printers, would be interesting to evaluate, unfortunately the data is not always available, particularly for the 18th century, which will pose a problem for the future of this study. Nevertheless, some indications suggest that it would be important to review the hypothesis that sees printing as a vector of change [5]: the limitations imposed by the type case of printers could for instance be a hindrance to change.

A more precise modelling of these changes is therefore on the agenda for our future research. Whether it concerns the identification of possible reasons for these changes, their more precise dating (in particular by integrating confidence intervals), or the addition of new data for the 18th century. The improvement of all data extraction and enrichment tools has already begun, and should thus allow the creation of an even larger and more precise corpus.

Figure 1 :1Figure 1: Data production pipeline.

Figure 2 :2Figure 2: Three page examples with zone objects.

Figure 3 :3Figure 3: Word error rate for the corpus.

Figure 4 :4Figure 4: Example of TEI encoding with normalisation.

Figure 5 :5Figure 5: Description of the OCRised corpus.

(a) Substitution of ‹es› by ‹é›. (b) Substitution of ‹as› by ‹â›.

Figure 7 :7Figure 7: Disappearance of ‹s› as a diacritical letter.

(a) Accumulation of occurrences of different spellings: two similar (es→é, as→â) and one different (ct→t). Data are scaled to base 100 to be comparable. (b) Theoretical progression of two similar variants over time, which start at different times, but progress at the same speed, according to the constant rate hypothesis.

Figure 8 :8Figure 8: The constant rate hypothesis in practice (left) vs in theory (right).

(a) Apparition of the contemporary use of ramist letters. (b) Disappearance of the etymological combination ‹ct›.

Figure 9 :9Figure 9: Computing the change-point of two changes of spelling.

(a) Slow decrease of the long s (‹ſ›). (b) Increase of the acute accent.

Figure 10 :10Figure 10: Long s vs acute accent.

Figure 11 :11Figure 11: Correlation ‹ſ›/acute accent.Pattern C: correlation. L. Biedermann-Pasques proposed as one of the parameters for spelling change the type case available: "the typographical use of the ligature has slowed down, in our opinion, the regular replacement of silent s by an accent"[7, p. 92]. It is indeed faster to compose the word eſtoit with the ligature (e+ſt+o+i+t=4 characters) than without (e+ſ|s+t+o+i+t=5 characters). One could argue that switching to the accented letter also requires only four characters (é+t+o+i+t=4 characters), but if the ligatures are present in number in the printer's type case, the accented characters are less so.Our working hypothesis is as follows: as ligatures are largely composed of a long s (‹ſ›), we should obtain a correlation between the use of this s (cf. fig.10a) and the acute accent (cf. fig.10b). We evaluate the correlation between the evolution of the two phenomena over time, and obtain a Pearson product-moment correlation coefÏcient of 0.365 with a p-value of 4.88e-20, which indicate a good correlation (cf. fig.11).

(a) Increase of the diaeresis. (b) Increase of the confusion ‹en|m›/‹an|m›.

Figure 12 :12Figure 12: Apparition of new phenomena.

Table 22Training and evaluation data for OCR.DatasetCenturyLanguage BooksLinesTrain/Dev16French7 17817Train/Dev17French19 20267Train/Dev16Latin12 10648Test16French10Test17French10Test18French10

Table 33Results of the two YOLO models on modern plays.Theatrical corpusTheatrical and LADaS corpus

Table 44Character and word error rates for both models.ModelsCharacters Errors CER WERNo fine-tuning383949242.4111.06Fine-tuning383946491.698.34

Table 55Character and word error rates for both models.% errors CER (part)Errors CorrectGenerated8.78% 7.55% 6.62%0.14% 0.13% 0.11%57 49 43SPACE ' s' ſ3.23%0.05%21-Ø2.77% 2.62% 2.16%0.05% 0.04% 0.04%18 17 14ſ ' Øf ' SPACE2%0.03%131I2%0.03%13.Ø1.85%0.03%12◌́Ø1.69%0.03%11,.1.54%0.03%100o1.54%0.03%10tr1.54%0.03%10◌̂Ø

.<sp><ab><seg><orig>SGANARELLE.</orig><reg>SGANARELLE.</reg></seg><seg><orig>Promettez-moy donc, Seigneur Geronimo, de me parler avec toute ſorte de franchiſe.</orig><reg>Promettez-moi donc, Seigneur Geronimo, de me parler avec toute sorte de franchise.</reg></seg></ab></sp><sp><ab><seg><orig>GERONIMO.</orig><reg>GERONIMO.</reg></seg><seg><orig>Ie vous le promets.</orig><reg>Je vous le promets.</reg></seg></ab></sp>

Table 6 :6Prefix similarity matrix for the original and normalised version of ‹Apoſtre›.In this article, we will distinguish between "spelling systems" and "orthography". To simplify, the first are coherent and competing logics of spelling words (as for the manuscripts with dialectal traits of the Middle Ages), the second is a strict norm which is recognised as a standard.The cases that may pose a problem (e.g. lui-mesme → luimesme, eng. "himself") represent less than 0.1% of the corrected hyphenations.https://huggingface.co/rbawden/modern_french_normalisation.Some subtleties are brought to this adjustment, such as et and & which are considered equivalent."When one grammatical option replaces another with which it is in competition across a set of linguistic contexts, the rate of replacement, properly measured, is the same in all of them. "[38]

Acknowledgments

Merci (dans l'ordre alphabétique) à Jean Barré, Alexandre Bartz, Rachel Bawden, Philippe Gambette et Benoît Sagot pour leur aide. À nos relecteur •trices aussi pour leurs excellentes remarques.

Data and code

All the data and code is available on our GitHub repo: https://github.com/DEFI-COLaF/Theat reLFSV2.

Funding

This paper has been funded by the DEFI Inria COLaF Corpus et Outils pour les Langues de France and the FNS-Spark project N°220833.

Cahiers de remarques sur l'orthographe françoise pour estre examinez par chacun de Messieurs de l 'Academie, avec des observations de Bossuet, Pellisson, etc

Paris

Jules Gay 1863 Académie française Changement linguistique et périodisation du français (pré)classique: deux études de cas à partir des corpus du RCFC AAmatuzzi WAyres-Bennett AGerstenberg LSchøsler CSkupien-Dekens 10.1017/s0959269520000058 Journal of French Language Studies 30 3 2020 L'Ortographie française au temps de la Réforme SBaddeley 1993 Droz Genève Estimating Multiple Breaks One at a Time JBai 10.1017/s0266466600005831 Econometric Theory 13 3 1997 Early Modern Studies: the Digital Turn ABasu Ill shapen sounds, and false orthography"': A Computational Approach to Early English Orthographic Variation LEstill DKJakacki MUllyot

Toronto

Iter Press 2016 Automatic Normalisation of Early Modern French RBawden JPoinhos EKogkitsidou PGambette BSagot SGabay LREC 2022 -13th Language Resources and Evaluation Conference. European Language Resources Association

Marseille, France

2022 LBiedermann-Pasques 10.1515/9783110938593 Les Grands Courants orthographiques au XVIIe siècle et la formation de l'orthographe moderne, Impacts matériels, interférences phoniques, théories et pratiques

Tübingen

Max Niemeyer Verlag 1992 Normalization of Historical Texts with Neural Network Models MBollmann 2018 Bochum Ruhr-Universität Bochum PhD thesis Manuscripts in Time and Space: Experiments in Scriptometrics on an Old French Corpus J.-BCamps Proceedings of the Second Workshop on Corpus-Based Research in the Humanities CRH-2 AUFrank CIvanovic FMambrini MPassarotti CSporleder the Second Workshop on Corpus-Based Research in the Humanities CRH-2

Vienna, Austria

2018 Histoire de l'orthographe française NCatach 2001 Honoré Champion Paris L'orthographe plantinienne NCatach JGolfand De Gulden Passer 50 1973 You Actually Look Twice At it (YALTAi): using an object detection approach instead of region segmentation within the Kraken engine TClérice 10.46298/jdmdh.9806 Journal of Data Mining & Digital Humanities 2023 Layout Analysis Dataset with SegmOnto TClérice JJanès HScheithauer SBénière LRomary BSagot DH2024 -Annual conference of the Alliance of Digital Humanities Organizations. Alliance of Digital Humanities Organizations (ADHO)

Washington, D.C., United States

2024 <author> <persName><forename type="first">P</forename><forename type="middle">Corneille</forename><surname>Le Théâtre De</surname></persName> </author> <author> <persName><forename type="first">P</forename><surname>Corneille</surname></persName> </author> <ptr target="https://gallica.bnf.fr/ark:/12148/bpt6k71442p" /> <imprint> <date type="published" when="1663">1663</date> <publisher>G. de Luyne</publisher> <pubPlace>Paris</pubPlace> </imprint> </monogr> </biblStruct> <biblStruct xml:id="b14"> <analytic> <title level="a" type="main">Dialectes et scriptae à l'époque de l'ancien français ADees Revue de Linguistique Romane 49 1985 Les éditions de textes du XVIIe siècle FDuval 10.1515/9783110302608-017 Manuel de la philologie de l'édition DTrotter

Berlin; Boston

De Gruyter 2015 Automatische Normalisierung -Vorarbeit zur Lemmatisierung eines diplomatischen altisländischen Textes HFix 10.1515/9783111438788.92 Teil 3 Beiträge zum dritten Symposion Tübingen 17

Berlin/Boston

Max Niemeyer Verlag Februar 1977. 1980 Phonological Rule Change: The Constant Rate Effect JFruehwald JGress-Wright JWallenberg Proceedings of the 40th Annual Meeting of the North East Linguistic Society the 40th Annual Meeting of the North East Linguistic Society

Cambridge, MA

GLSA Publications 2013 <author> <persName><forename type="first">S</forename><surname>Gabay</surname></persName> </author> <idno type="DOI">10.5281/zenodo.11526150</idno> <imprint> <date type="published" when="2024">2024</date> </imprint> </monogr> <note type="report_type">Fondue-fr-print</note> </biblStruct> <biblStruct xml:id="b19"> <monogr> <title/> <author> <persName><forename type="first">S</forename><surname>Gabay</surname></persName> </author> <idno type="DOI">10.5281/zenodo.11526040</idno> <imprint> <date type="published" when="2024">2024</date> </imprint> </monogr> <note type="report_type">Fondue-fr-print</note> </biblStruct> <biblStruct xml:id="b20"> <analytic> <title level="a" type="main">FreEM-corpora/FreEMnorm: FreEM norm Parallel corpus SGabay 10.5281/zenodo.5865428 Version 1 0 Pourquoi moderniser l'orthographe? Principes d'ecdotique et littérature du XVIIe siècle SGabay doi: 99.125005/vox201410027 Vox Romanica 73 1 2014 Traduction automatique pour la normalisation du français du XVIIe siècle SGabay LBarrault Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL Traitement Automatique des Langues Naturelles

Nancy, France

2020 2 Actes de la 6e conférence conjointe Journées d'Études sur la Parole (JEP, 33e édition). 22e édition Le changement linguistique au XVIIe s. : nouvelles approches scriptométriques SGabay RBawden PGambette JPoinhos EKogkitsidou BSagot 10.1051/shsconf/202213802006 CMLF 2022 -8e Congrès Mondial de Linguistique Française

Orléans, France

EDP Sciences 2022 138 SegmOnto: common vocabulary and practices for analysing the layout of manuscripts (and more) SGabay J.-BCamps APinche CJahan 1st International Workshop on Computational Paleography (IWCPICDAR 2021)

Lausanne, Switzerland

2021 Reconnaissance des écritures dans les imprimés SGabay TClérice PJacsont ELeblanc MJeannot-Tirole SSolfrini SDolto FGoy CCLuján MZaglio MPerregaux JJanès BSagot RBawden RDent ONédey AChagué Humanistica 2024. Association francophone des humanités numériques

Meknès, Morocco

2024 -TEST-longS SGabay TClérice JJanès 10.5281/zenodo.11526316 2024 FONDUE-MLT-PRINT Ancien ou moderne ? Pistes computationnelles pour l'analyse graphématique des textes écrits au XVIIe siècle SGabay PGambette RBawden BSagot 10.4000/linx.9346 Linx 85 2023 <author> <persName><forename type="first">S</forename><surname>Gabay</surname></persName> </author> <author> <persName><forename type="first">M</forename><surname>Jeannot-Tirole</surname></persName> </author> <author> <persName><forename type="first">F</forename><surname>Goy</surname></persName> </author> <author> <persName><surname>Fondue</surname></persName> </author> <idno type="DOI">10.5281/zenodo.11526160</idno> <imprint> <date type="published" when="2024">2024</date> <biblScope unit="volume">16</biblScope> </imprint> </monogr> </biblStruct> <biblStruct xml:id="b29"> <analytic> <title level="a" type="main">Dialektometrie HGoebl 10.1515/9783110155785 Quantitative Linguistik/Quantitative Linguistics. Ein internationales Handbuch/An International Handbook

Berlin; New York

De Gruyter Mouton 2005 Modelling the Dynamics of Language Change: Logistic Regression, Piotrowski's Law, and a Handful of Examples in Polish RLGórski MEder 10.1080/09296174.2022.2151208 Journal of Quantitative Linguistics 30 1 2023 Modeling the Decline in English Passivization LHou DSmith 10.7275/r5zc812c Proceedings of the Society for Computation in Linguistics (SCiL) 2018 GJarosz BO'connor JPater the Society for Computation in Linguistics (SCiL) 2018 2018 Towards automatic TEI encoding via layout analysis JJanès APinche CJahan SGabay Fantastic future 21, 3rd International Conference on Artificial Intelligence for Librairies, Archives and Museums. AI for Libraries

Archives, and Museums (AI4LAM

Standard-based Lexical Models for Automatically Structured Dictionaries MKhemakhem 2020 Paris Université de Paris PhD thesis Automatic Extraction of TEI Structures in Digitized Lexical Resources using Conditional Random Fields MKhemakhem LFoppiano LRomary Electronic lexicography, eLex 2017

Leiden, The Netherlands

2017 Kraken -an Universal Text Recognizer for the Humanities BKiessling 10.34894/z9g2ex Digital Humanities Conference 2019 -DH2019

Utrecht, The Netherlands; ADHO)

Alliance of Digital Humanities Organizations 2019 Detecting Linguistic Change Based on Word Co-occurrence Patterns CKlaussner CVogel ABhattacharya Proceedings of the 4th International Workshop on Computational History the 4th International Workshop on Computational History

Singapore

2017 Reflexes of grammar in patterns of language change ASKroch 10.1017/s0954394500000168 Language Variation and Change 1 3 1989 Binary codes capable of correcting deletions, insertions, and reversals VLevenshtein Soviet physics doklady 10 8 1966 Digital Classical Philology Ancient Greek and Latin in the Digital Revolution LMuellner 10.1515/9783110599572-002 Chap. The Free First Thousand Years of Greek

Berlin/Boston

De Gruyter Saur 2019 Page Layout Analysis of Text-heavy Historical Documents: a Comparison of Textual and Visual Approaches SNajem-Meyer MRomanello 10.48550/arXiv.2212.13924 Proceedings of the Computational Humanities Research Conference 2022 the Computational Humanities Research Conference 2022

Antwerp, Belgium

2022 A general method applicable to the search for similarities in the amino acid sequence of two proteins SBNeedleman CDWunsch 10.1016/0022-2836(70)90057-4 Journal of Molecular Biology 48 3 1970 Measuring dialect differences JNerbonne WHeeringa 10.1515/9783110220278.550 Theories and Methods: An International Handbook of Linguistic Variation De Gruyter Mouton 2010 1 Netherlands Historical Data Archive, Nijmegen Institute for Cognition & Information. Optical Character Recognition in the Historical Discipline Proceedings of an International Workshop an International Workshop

St. Katharinen

Halbgraue Reihe zur Historischen Fachinformatik 1993 Between automatic and manual encoding APinche KChristensen SGabay 10.5281/zenodo.7092214 TEI 2022 conference : Text as data

Newcastle, United Kingdom

2022 ABA (Alignment-Based Approach JPoinhos 2020 Version 1 Real-Time Flying Object Detection with YOLOv8 DReis JKupec JHong ADaoudi 10.48550/arxiv.2305.09972 2023 LRemacle Le Problème de l'ancien wallon

Liège

Presses universitaires de Liège 1948 Large-Scale Optical Character Recognition of Ancient Greek BRobertson FBoschetti Mouseion: Journal of the Classical Association of Canada 58 3 2017 GROBID -Information Extraction from Scientific Publications LRomary PLopez ERCIM News. Scientific Data Sharing and Re-use 100 2015 Combining OCR Outputs for Logical Document Structure Markup. Technical Background to the ACL 2012 Contributed Task USchäfer BWeitz Proceedings of the ACL-2012 Special Workshop on Rediscovering 50 Years of Discoveries REBanchs the ACL-2012 Special Workshop on Rediscovering 50 Years of Discoveries

Jeju Island, Korea

2012 Association for Computational Linguistics Which TEI representation for the output of automatic transcriptions and their metadata? An illustrated proposition HScheithauer AChagué LRomary 2022 La dialectométrie dans l'Atlas linguistique de la Gascogne JSéguy Revue de linguistique romane 37 1973 The CLIN27 Shared Task: Translating Historical Text to Contemporary Language for Improving Automatic Linguistic Annotation ETjongKim Sang MBollmann RBoschker FCasacuberta FDietz SDipper MDomingo RVan Der Goot MVan Koppen NLjubešić RÖstling FPetran EPettersson YScherrer MSchraagen LSevens JTiedemann TVanallemeersch KZervanou Computational Linguistics in the Netherlands Journal 7 2017 CHVachon Le Changement linguistique au XVIe siècle: une étude basée sur des textes littéraires français

Strasbourg

Éditions de linguistique et de philologie 2010 Remarques sur la langue françoise CF DVaugelas 2009 Droz Geneva Remarques sur la langue françoise, utiles à ceux qui veulent bien parler et bien escrire CF DVaugelas Vve J. Camusat et P. Le Petit 1647 Paris Detection of the disorder in multidimensional random-processes LVostrikova Doklady Akademii Nauk SSSR 259 2 1981 An improved test of the constant rate hypothesis: late Modern American English possessive have RZimmermann 10.1515/cllt-2021-0038 Corpus Linguistics and Linguistic Theory 19 3 2023