=Paper= {{Paper |id=Vol-3834/paper21 |storemode=property |title=The birth of French orthography. A computational analysis of French spelling systems in diachrony |pdfUrl=https://ceur-ws.org/Vol-3834/paper21.pdf |volume=Vol-3834 |authors=Simon Gabay,Thibault Clérice |dblpUrl=https://dblp.org/rec/conf/chr/GabayC24 }} ==The birth of French orthography. A computational analysis of French spelling systems in diachrony== https://ceur-ws.org/Vol-3834/paper21.pdf

The birth of French orthography. A computational
analysis of French spelling systems in diachrony⋆
Simon Gabay1,∗ , Thibault Clérice1,∗
2
Université de Genève
1
Inria Centre de Recherche de Paris

Abstract
The 17th c. is crucial for the French language, as it sees the creation of a strict orthographic norm
that largely persists to this day. Despite its significance, the history of spelling systems remains how-
ever an overlooked area in French linguistics for two reasons. On the one hand, spelling is made up of
micro-changes which requires a quantitative approach, and on the other hand, no corpus is available
due to the interventions of editors in almost all the texts already available. In this paper, we therefore
propose a new corpus allowing such a study, as well as the extraction and analysis tools necessary for
our research. By comparing the text extracted with OCR and a version automatically aligned with con-
temporary French spelling, we extract the variant zones, we categorise these variants, and we observe
their frequency to study the (ortho)graphic change during the 17th century.

Keywords
Computational linguistics, History of orthography, Information extraction, Corpus building

1. Introduction
The grapho-phonetic aspects of French during the 17th c. paradoxically remain very poorly
known, despite the importance of the graphematic question at this period, which saw the ap-
pearance of the French orthography1 . Rather than the actual practice of scriptors, it is the depth
of theoretical debates on spelling that has until now concentrated most of research (e.g. [10]
or [7]), and the notebooks of Mezeray [1] or the Remarques of Vaugelas [56, 57] still remain
among the main sources used, rather than statistical surveys on vast corpora.
If the various dialects and other scriptae populating Old and Middle French have been abun-
dantly described (e.g. in [15]), just like the “orthographie” of the Renaissance (to quote the term
used by Baddeley [3]), the slow imposition of an orthographic norm throughout modern times,
although a major phenomenon in the history of a language as prescriptive as French, remains
a blind spot in diachronic linguistics. How has the French that we know today supplanted its
various modern variations?

CHR 2024: Computational Humanities Research Conference, December 4–6, 2024, Aarhus, Denmark
∗
Corresponding author.
£ simon.gabay@unige.ch (S. Gabay); thibault.clerice@inria.fr (T. Clérice)
ç https://cv.hal.science/simon-gabay (S. Gabay); https://cv.hal.science/thibault-clerice (T. Clérice)
Ȉ 0000-0001-9094-4475 (S. Gabay); 0000-0003-1852-9204 (T. Clérice)
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
1
In this article, we will distinguish between “spelling systems” and “orthography”. To simplify, the first are coherent
and competing logics of spelling words (as for the manuscripts with dialectal traits of the Middle Ages), the second
is a strict norm which is recognised as a standard.

246
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
One of the main technical challenges for carrying out such a study relies on the existence
of important amounts of data, in order to guarantee quantitatively the reliability of the results.
Unfortunately, such corpora of classical French do not exist for two reasons. On the one hand,
as Cl. Vachon [55, p. 32, n. 31] bitterly experienced, text editors got into the habit of stan-
dardising the language of that era [22, 16], which makes its study particularly complicated, if
not impossible to use for graphematic studies. On the other hand, the few corpora that have
been created, like that of Cl. Vachon, but also others like that of the Réseau Corpus Français
Préclassique et Classique (RCFC) [2] are not, or not in full, available to researchers. Given the
ever-increasing quantities of data necessary for computational studies, it is however dubious
that these two corpora, even freely accessible, would in any case remain insufÏcient for the
most recent approaches proposed in NLP.
This paper proposes to return to the history of the French vêtement graphique (“graphic
clothing”) in a computational way. We introduce a two-step approach: first, a unique cor-
pus creation pipeline meticulously extracts spelling information from digital facsimiles. This
pipeline includes a layout analysis model to distinguish text from paratext on the page, an OCR
model that retains the historical character ‹ſ›, pivotal to written French, and a linguistic nor-
maliser that “translates” historical French into its contemporary counterpart at the sentence
level. In the second step, we analyse the created corpus using a comparison algorithm that
matches the extracted historical text with its modern equivalent at the character level. This
enables us to pinpoint significant variations, categorise these differences, and uncover detailed
trends throughout the 17th century. This methodological framework not only enhances our
understanding of historical French orthography, but also proposes a new approach for compu-
tational linguistic studies of spelling variation.

2. State of the art
Corpus building from OCR has long been a task in digital humanities and corpus linguis-
tics. Initially deemed unsuitable for historical sources in 1993 [44], OCR gained credibility in
the late 1990s for corpus building, including XML TEI formalisation in commercial projects
such as the Patrologia Latina Database, and for Ancient Greek scripts in the 2010s [49]. Most
project using TEI, such as the First1KGreek project, relied on manual formalisation of the text’s
logical structure [40], as manual work was considered essential for accuracy. The advent of
user-friendly OCR and HTR technologies has spurred interest in automatic document formal-
isation (ADF), primarily focused on facsimile formalisation [52] and noisy text removal with
tools based on vocabularies such as SegmOnto [25], which standardises the identification of
paratextual zones (running titles, footnotes, etc.). Few projects, however, have utilised font,
geometric, and textual features to reconstruct or emulate the original text structure from born-
digital PDFs or OCR outputs. PaperXML [51] demonstrated such transformation but was lim-
ited to the ACL Anthology structure. Grobid [50] and Grobid Dictionaries [35, 34] employed
geometric, font, and textual features to produce XML TEI output, though they were specific
to scientific papers and dictionaries. In 2022, visual features outperformed linguistic ones in
document formalisation, with YOLO models using the SegmOnto controlled vocabulary sur-
passing LayoutLM models in multilingual settings [41]. Recently, research has started on OCR

247
output formalisation for corpus building with a controlled vocabulary and a training dataset
for models [33, 45]. Lastly, the Layout Analysis Dataset with SegmOnto (LADaS) [13] allowed
a much finer granularity in the analysis, and a significant improvement of the entire pipeline
for the automatic creation of files encoded in XML-TEI that goes beyond facsimile approach
and closer to reproducing the logical structure of the text.

Linguistic Normalisation (LN) has a long history, dating back to the 80’s [17], but has
developed itself as derived task from Machine Translation (MT) in the beginning of the 2010’s,
usually to improve downstream tasks in the pipeline such as linguistic annotation [54]. LN
share important similarities with MT, and therefore relies on the same methods, but with a
slightly different objective: to “translate” a source into another state of the language, usually
more recent (16th c. German→ contemporary German), rather than into another language
(Italian → German). Resources existed first for Slovene, German, English, Hungarian, Spanish,
Swedish, Portuguese [8], but several studies have recently improved both resources [21] and
techniques for historical French, first comparing rule-based, statistical and neural methods [23],
and then alignment-based and neural MT-approaches [6].

Computational scriptology is based on the notion of scripta, coined by Remacle [48] and
widely used in Romanistics to to distinguish a spoken language (the dialect) and a written lan-
guage (the scripta). The first studies on dialectometry date back from the early 70’s with the
pioneer work of Jean Séguy, who invented the term dialectométrie [53], on the distance between
dialects in vast corpora [48]. Since then, two main schools, based in Salzburg [30] and Gronin-
gen [43], have advanced research on the topic, but relying mainly on geographical data to lo-
calise dialects. In parallel to these research, Cl. Vachon has changed the approach, switching to
corpus-based research, using historical data to study semi-automatically the spelling [55], and
more recently, J.-B. Camps has shifted the method, using unsupervised stylometry to categorise
medieval scriptae [9]. Regarding modern French, alternative studies have proposed alignement-
based approaches to compare the historical source and an automatically normalised version to
detect the evolution of spellings [24] or to categorise documents [28].

3. Corpus building
3.1. Data
For practical reasons, a first corpus of limited size (c. 600 texts) spanning the 17th c. was pro-
duced with our pipeline. The data comes from the Gallica digital library and contains only
French-language documents. For our experiment, we have selected only plays, which offer
medium size documents (compared to novels, potentially much longer), and linguistically ho-
mogeneous data (spelling can influenced by the genre, such as legal documents which tend to
use more “archaic” traits and may involve Latin phrases).

248
Figure 1: Data production pipeline.

3.2. Method
Our pipeline allows us to extract data, enrich it and store it in a standard format (cf. fig. 1).
Firstly, we apply a layout analysis model specialised in theatrical data trained for the occasion,
then we use an OCR model prepared for this study which preserves the long s (‹ſ›). Based on
the layout analysis we convert ALTO files to TEI files. Only textual data which contains text of
the work (paragraph, speech, verse, etc.), and not linked to the structure of the book (running
title, page number, quire marks, etc.) is extracted and normalised automatically, before to be
reintroduced into the TEI file.

Layout analysis. Based on the results Table 1
of Najem-Meyer and Romanello [41] and Training and evaluation data for Layout Anal-
the initial evaluation of YOLO region seg- ysis across the datasets. Each image rep-
resent a single document in the dataset.
menter against Kraken’s [36] as a region
segmenter with YALTAi [12], we proposed Train Dev Test
to evaluate the ability of YOLOv8 [47] Images 497 61 62
to detect regions in our 17th c. print MainZone-Sp 1738 219 187
corpus. For this purpose, we anno- NumberingZone 384 49 45
tated one random image from each digi- RunningTitleZone 373 43 41
DigitizationArtefactZone 189 23 24
tised version of our corpus, which could QuireMarksZone 183 29 15
include empty pages (e.g., bookbinding, MainZone-Head 159 28 33
MainZone-Sp-Continued 154 16 18
cover) and full pages. This resulted DropCapitalZone 136 23 23
in a corpus of 620 images for training, GraphicZone-Decoration 130 16 22
MainZone-Entry 89 6 19
evaluation, and testing. Our final cor- MainZone-Lg 58 5 1
pus comprises 32 null pages (without an- MainZone-P 41 10 12
notations) and a variety of annotations, MainZone-P-Continued
MarginTextZone-ManuscriptAddendum
28
21
1
3
2
7
with a majority of speech-related tags MarginTextZone-Notes 30 1 0
(MainZone:SP, MainZone:SP#Continued), StampZone-Sticker 24 1 2
MainZone-Other 19 3 2
paratextual-related objects (e.g., Number- StampZone 13 3 5
ingZone, RunningTitleZone), a smaller TitlePageZone 8 2 4
GraphicZone 5 0 1
number of logical structuring features MainZone-Incipit 4 1 0
such as scene titles (MainZone:Head) and MainZone-Signature 1 0 0
cast lists (MainZone:Entry), as well as a
few paragraphs and poetic excerpts, mainly found in incipits or prefaces of the books
(MainZone:P), as seen in tab. 1.

249
(a) Groundtruth. (b) Prediction. (c) Prediction.

Figure 2: Three page examples with zone objects.

Optical character recognition. Since Table 2
YOLO is well integrated within YALTAi, Training and evaluation data for OCR.
which in turn works seamlessly with
Kraken, we decided to use the latter to Dataset Century Language Books Lines
train a new OCR model that includes the Train/Dev 16 French 7 17817
long s (‹ſ›). Kraken, unlike other OCR sys- Train/Dev 17 French 19 20267
tem, avoids the integration of a strong lan- Train/Dev 16 Latin 12 10648
Test 16 French 10
guage model which in turn, for our pur- Test 17 French 10
pose, allows for keeping more variations. Test 18 French 10
This new model, derived from CATMuS
Print [26], uses three datasets for fine-tuning [19, 20, 29] and one evaluation dataset [27]
(cf. tab. 2). We evaluate on a test set that includes data spanning three centuries (from the
16th to the 18th) and comprises one page from 10 different documents for each period.

TEI Document production. Document formalisation follows a logical approach based on
the ALTO output produced by Kraken and YALTAi, rather than a neural one. Each region is
processed in reading order, with regions not matching MainZone being ignored, except for the
“default” region, which handles orphan lines. The default region is placed into a (“forme
work”) tag, which is typically excluded from our text export processes. Regions marked as
#Continued are logically merged with previous ones. Each line is prepended by a TEI
(line beginning) tag to facilitate back-to-document correction capabilities. Hyphenisation is
resolved by removing hyphen but keeping the tag at its place.2 While machine learning
is employed for initial region detection, the formalisation process itself does not involve any
2
The cases that may pose a problem (e.g. lui-mesme → luimesme, eng. “himself”) represent less than 0.1% of the
corrected hyphenations.

250
learned behaviour. Metadata are systematically integrated in the , using infor-
mation automatically retrieved from the catalogue of the French National Library via the ark
ID.

Linguistic normalisation. All documents are processed via a normaliser previously
trained3 . Only text contained in and (“speech”) elements are kept for normalisa-
tion, because a specific spelling variation occurring in the running title, for instance, would be
repeated every two pages and potentially alter artificially the result of the scriptometric anal-
ysis. The text is split into sentences (ending by a full stop, an exclamation or a question mark)
or subsentences (ending by a colon or a semicolon), all stored in a (“arbitrary segment”)
element, with the source text in (“original form”) and the automatically normalised
text in (“regularization”). The normalised version is evaluated against a dictionary of
modern French to control the quality of the final product.

3.3. Experimental Setup and Evaluation
Layout analysis. We evaluate two possible setups: both use fine-tuning with the original
YOLOv8L models and an input image size of 960 pixels (higher than the default). One setup uses
only the dataset produced in the context of this paper, while the other merges this dataset with
the larger LADaS dataset (5,000 images). We train both setups for 100 epochs with otherwise
default parameters.

Table 3
Results of the two YOLO models on modern plays.
Theatrical corpus Theatrical and LADaS corpus
Class Images Instances Box(P R mAP50 mAP50-95) Box(P R mAP50 mAP50-95)
all 62 463 0.824 0.705 0.768 0.626 0.739 0.738 0.8 0.666
MainZone-Entry 2 19 0.456 0.048 0.463 0.244 0.857 0.316 0.76 0.486
MainZone-Head 24 33 0.915 0.697 0.825 0.698 0.854 0.532 0.722 0.587
MainZone-Lg 1 1 0.74 1 0.995 0.895 0.807 1 0.995 0.895
MainZone-Other 2 2 1 0 0.174 0.139 0.0427 0.107 0.105 0.0732
MainZone-P 5 12 0.549 0.711 0.66 0.474 0.655 0.25 0.55 0.49
MainZone-P-Continued 2 2 0.92 1 0.995 0.946 0.385 1 0.995 0.995
MainZone-Sp 41 187 0.967 0.979 0.988 0.941 0.955 0.973 0.982 0.924
MainZone-Sp-Continued 18 18 1 0.891 0.995 0.93 0.978 1 0.955 0.969

Since our study focuses exclusively on the MainZone, which contains the primary text and
excludes all paratextual elements (such as decorations, page numbers, and running titles), we
have concentrated our evaluation on this specific zone. Overall, when considering all classes,
we found that integrating our data with the LADaS corpus yields improved results (0.768 vs.
0.8). However, for the most critical classes (Sp and Sp-continued), the model trained exclu-
sively with theatrical data produces slightly better outcomes. As previously mentioned, these
are the classes essential for our study.

3
https://huggingface.co/rbawden/modern_french_normalisation.

251
Text recognition. To fine-tune and Table 4
adapt the CATMuS Print OCR model to Character and word error rates for both models.
the allographic variation of round s/long Models Characters Errors CER WER
s, we modified the classifier codec (-- No fine-tuning 38394 924 2.41 11.06
resize new mode) and used a standard Fine-tuning 38394 649 1.69 8.34
learning rate of 0.0001, along with a batch
size of 32. This logical approach ensures
Table 5
the model is fine tuned to the specific ty- Character and word error rates for both models.
pographic variations without relying on
any learned behaviour during the formali- % errors CER (part) Errors Correct Generated
sation process. We compare this approach 8.78% 0.14% 57 SPACE
to a model without fine-tuning, trained 7.55% 0.13% 49 ' ’
from scratch, with the same architecture 6.62% 0.11% 43 s ſ
3.23% 0.05% 21 – Ø
(cf. tab. 4), revealing the superiority of the 2.77% 0.05% 18 ſ f
approach with fine tuning. 2.62% 0.04% 17 ’ '
Most of the errors are errors related to 2.16% 0.04% 14 Ø SPACE
2% 0.03% 13 1 I
poor segmentation of the text (cf. tab. 5), in 2% 0.03% 13 . Ø
which there should be a space that is miss- 1.85% 0.03% 12 ◌́ Ø
ing from the prediction – a classic error for 1.69% 0.03% 11 , .
1.54% 0.03% 10 0 o
historical prints. The prediction errors re- 1.54% 0.03% 10 t r
garding two types of apostrophes (curved 1.54% 0.03% 10 ◌̂ Ø
or straight) are of little concern because
they do not affect the result from a linguistic point of view and are due to poor data prepa-
ration that is easily correctable. The confusion between the round s and the long s is likely
attributable to the fine-tuning process and the absence of the long s in the base model.

Linguistic normalisation To evaluate
the results of the normalisation, we com-
pare the prediction of the normaliser with
a dictionary of contemporary French to
obtain a Word Accuracy (WAcc). Results
are satisfactory (cf. fig. 3), with a median
above 90%. Texts with a WAcc under
80% are removed to avoid using unreliable
data.

3.4. Result dataset
The final dataset is made of around 80,000
pages for 620 documents. While the
Figure 3: Word error rate for the corpus.
number of unit is uneven over the years
(cf. fig. 5a), the accumulated tokens are progressing evenly (cf. fig. 5b). An example of our TEI
encoding is presented in fig. 4.

252

SGANARELLE.
SGANARELLE.

Promettez-moy donc, Seigneur Geronimo, de me parler avec toute ſorte de franchiſe.
Promettez-moi donc, Seigneur Geronimo, de me parler avec toute sorte de franchise.

GERONIMO.
GERONIMO.

Ie vous le promets.
Je vous le promets.

Figure 4: Example of TEI encoding with normalisation.

(b) Accumulated tokens over the years. Year of
(a) Number of bibliographical units per year. The printing is used for the date, any document with-
bin around 1720 represent printed books from out a precise date are removed from the plot. To-
within the 17th century but with unclear or im- kens are taken from the original OCR documents,
precise printing dates. only from the MainZones.

Figure 5: Description of the OCRised corpus.

253
4. Evaluation of spelling variation
4.1. Method
We use the ABA [46] tool to precisely identify the portions of words which differ between
the original version and the normalised version, and group similar differences, for example
having the same historical-linguistic origin, or the same type of operations in terms of addition,
deletion or modification of characters. Each and of the corpus is split into words,
the punctuation is removed, and then the original and normalised versions are aligned at the
word level using the Needleman-Wunsch [42] algorithm, using the Levenshtein distance [39]
between each pair of words in the same in the original and normalised version4 .
Secondly, for each of the aligned word pairs, the original version and the normalised version
are aligned at the character level, still using the Needleman-Wunsch algorithm, but using a
specific substitution matrix to allow not only identical letters to be aligned, but also letters con-
sidered close in (pre)classical French and contemporary French (presence/absence of diacritic,
ligatures…). For example, while identical letters benefit from a substitution score of 4, letters
differing only in accent or cedilla benefit from a score of 2, as do ‹ſ› and ‹s› or ‹s› and ‹ß› for
example. Other pairs of letters benefit from a score of 1, such as ‹u› and ‹v›, ‹s› and ‹z› or even
‹n› and ‹m ›. Conversely, a score of -1 is assigned to pairs of distinct letters not subject to such
exceptions, as well as to the deletion or insertion of a character.
This execution of the Needleman-
Wunsch algorithm to obtain Table 6: Prefix similarity matrix for the original and
character-level alignment is il- normalised version of ‹Apoſtre›. The arrows indicate
the previous box on the optimal path to calculate the
lustrated in the matrix in tab. 6, similarity between two prefixes, one from the word on
where each number represents the first row, the other from the word in the first col-
umn. On this optimal path, green indicates equality,
the similarity score of the best red indicates substitution, and blue indicates deletion.
alignment found between the
A p o ſ t r e
prefix of ‹Apoſtre› and ‹Apôtre› ↘ → → → → → →
A 4 3 2 1 0 -1 -2
up to this box. It is preceded by ↓ ↘ → → → → →
p 3 8 7 6 5 4 3
an arrow indicating which box ô ↓
2 ↓
7 ↘
10 →
9 →
8 →
7 →
6
to come from to obtain this best t ↓
1 ↓
6 ↓
9 ↓
8 ↘
13 →
12 →
11
alignment. For example, to obtain r ↓
0 ↓
5 ↓
8 ↓
7 ↓
12 ↘
17 →
16
↓ -1 ↓ 4 ↓ 7 ↓ 6 ↓ 11 ↓ 16 ↘ 21
the best alignment between ‹Apoſ› e
and ‹Apô›, we must consider the
best alignment between ‹Apo› and ‹Apô› (which has a score of 10) then make an insertion of T,
which has a score of -1, which provides a total score of 9. If we had preferred to first consider
the best alignment between ‹Apoſ› and ‹Ap›, which has a score of 6, then delete the ô, which
has a score of -1, we would have obtained an alignment with a score of 5, therefore lower than
optimal. In case of insertion or deletion during this alignment step, we use the ¤ character
in order to obtain two words of the same length in both the original and normalised version.
Thus, at the end of this second alignment step, the word ApoTtre in the original version is
matched with apô¤tre in a normalised version to obtain character-by-character alignment.
Finally, for each word in the corpus, its original and normalised versions are analysed, char-
4
Some subtleties are brought to this adjustment, such as et and & which are considered equivalent.

254
acter by character, to detect, in the case of different characters at the same position, the nor-
malisation rule that applies, or to signal that no existing rule was identified when appropri-
ate. 72 rules were defined based on the bibliography and the differences observed in the gold
FreEMnorm parallel corpus [21]. For example, the rule Ramist letter is detected if an ‹i›, a ‹j›,
an ‹u› or a ‹v› is present in the associated original word respectively to a ‹j›, an ‹i›, a ‹v› or an
‹u› in the normalised version.

4.2. Results
Based on the alignments obtained using the
Needleman-Wunsch algorithm and the detec-
tions of the 72 rules mentioned earlier, our anal-
ysis reveals four distinctive patterns of histori-
cal spelling changes. The principle underlying
this analysis is straightforward: if a normalisa-
tion rule is detected less frequently, it indicates
that the historical spelling it targets is becom-
ing less prevalent in the corpus. To examine its
evolution throughout the century, we normal-
Figure 6: Disappearance of ‹gn›.
ize the total number of rule applications to its
percentage within each text. For instance, the etymological spelling ‹gn›, found in form cog-
noitre (