=Paper=
{{Paper
|id=Vol-3290/long_paper5740
|storemode=property
|title=Detecting Formulaic Language Use in Historical Administrative Corpora
|pdfUrl=https://ceur-ws.org/Vol-3290/long_paper5740.pdf
|volume=Vol-3290
|authors=Marijn Koolen,Rik Hoekstra
|dblpUrl=https://dblp.org/rec/conf/chr/KoolenH22
}}
==Detecting Formulaic Language Use in Historical Administrative Corpora==
Detecting Formulaic Language Use in Historical Administrative Corpora Marijn Koolen1,2 , Rik Hoekstra1,2 1 KNAW Huygens Institute, Amsterdam, the Netherlands 2 DHLab, KNAW Humanities Cluster, Amsterdam, the Netherlands Abstract Historical administrative corpora are 昀椀lled with jargon and formulaic expressions that were used con- sistently across many documents. Governmental decisions, notarial deeds and o昀케cial charters o昀琀en contain 昀椀xed expressions to ensure that the same legal aspects in di昀昀erent documents had the same interpretation. Such formulaic expressions can be used to identify speci昀椀c elements of a document. For instance, a deed has di昀昀erent formulas to indicate whether it concerns the sale of property or the transferal of rights. In this paper we explore formulas as a methodological devise to structure the text of an administrative corpus and make the information contained in it better accessible. We use a data- driven method to detect potential formulaic expressions in historical corpora, that can deal with spelling variation and change and recognition errors introduced in the digitisation process. We apply this ex- ploratory technique on a corpus of almost 300,000 eighteenth-century resolutions of the States General of the Dutch Republic and 昀椀nd many formulaic expressions that capture relationships between the polit- ical actors involved and the decisions that were made. A 昀椀rst analysis suggests that many formulas can be used to add metadata to individual resolutions on various elements of the proposals and decisions that are part of each resolution. Keywords formulaic expressions, text reuse, document structure, information extraction, text analysis 1. Introduction The Resolutions of the States General of the Dutch Republic (1576-1796) is a digitised archive containing an estimated 1 million decisions made by the States General (SG) during their daily meetings. It is 昀椀lled with administrative jargon and formulaic expressions that were used con- sistently, tens of thousands of times across a 220 year period, in resolutions with a very 昀椀xed structure. These formulaic expressions were used to signal speci昀椀c elements in the text, so that anyone relying on the resolutions for their day-to-day work could easily 昀椀nd back requests, de- cisions and agreements by looking for these 昀椀xed phrasings, which also made sure that similar decisions and agreements had similar interpretations. In this paper we explore formulas as a methodological device to structure the text of the archive and make the information contained in it better accessible. The secretaries of the meet- ings used a 昀椀xed structure and 昀椀xed expressions to signal the opening of a new resolution, CHR 2022: Computational Humanities Research Conference, December 12 – 14, 2022, Antwerp, Belgium £ marijn.koolen@gmail.com (M. Koolen); rik.hoekstra@di.huc.knaw.nl (R. Hoekstra) ç https://marijnkoolen.com/ (M. Koolen) ȉ 0000-0002-0301-2029 (M. Koolen); 0000-0002-6951-8014 (R. Hoekstra) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 127 Figure 1: The formulaic expression ‘Waer op gedelibereert zijnde, is goetgevonden ende verstaen’ used in four resolutions taken from a single meeting on the 22nd of November 1709. that always started with a proposal submitted to the SG. Each resolution ends with a decision paragraph, which also starts with a formulaic expression, followed by the details of what was agreed upon and what should happen next. For instance, to signal the SG reached an agree- ment on what should be done in response to a proposition, they used the formula ‘Waer op gedelibereert zijnde, is goetgevonden ende verstaen, ...’ (EN: On which has been agreed and understood .... A number of examples of this phrase are shown in Figure 1). This phrase recurs tens of thousands of times in the resolutions and signals that an agreement and decision were reached that are detailed in the following paragraph. The formulas thus not only help us structure the material, but also to add metadata to the individual resolutions and make them better accessible for analysis. Our experience in working with other historical collections prompted a set of questions: Are such formulaic expressions used in other administrative corpora? And what other domains and document genres contain formulaic expressions? From an information access perspective, is the textual repetition of an administrative corpus like the Resolutions di昀昀erent from textual repetition in corpora in other domains? To get a better idea of the relevance of formulas, we 昀椀rst compare repetitive text character- istics in corpora of di昀昀erent domains, to establish whether administrative texts like the reso- lutions are of a di昀昀erent quality from other types of text. The second topic of this paper is the methodology we use for algorithmically identifying potential formulas in the resolutions, 128 and a discussion of their use. In this paper we con昀椀ne ourselves to these points, but we realise that formulas have relevance for a number of wider humanities research questions. We discuss these at the end in Section 6. 2. Related Work Our work touches on three strands of research: 1) formulaic language use, 2) text reuse detec- tion and 3) dealing with variation-rich text. 2.1. Formulaic Language The use of formulaic expressions is mostly studied in the 昀椀elds of linguistics [47, 26] and lan- guage learning [14, 7, 42, 13]. Formulaic language is typically de昀椀ned as 昀椀xed word combina- tions, with o昀琀en non-literal meanings, that are used to improve 昀氀uency and reduce misunder- standing [47, 41, 46]. Poß and Wouden studied formulaic expressions consisting of anything more than one word as Extended Lexical Units, stored as single a entry in a speaker’s mental lexicon. We found two studies that investigated the use of formulas in text corpora. Karsdorp iden- ti昀椀ed and classi昀椀ed formulaic opening and closing expressions of Dutch folk tales. Repetitive patterns in the 昀椀rst and last 昀椀ve words of folk tales are detected and are found to be predictive of the genre of a folk tale. That is, the opening formula o昀琀en signals that something is a joke, saga or fairy tale. In the resolutions, the opening formulas are similarly indicative of what kind of proposal (petition, report, declaration, etc.) is discussed in the resolution. [37] manually an- notated the use of formulaic expressions in seventeenth and eighteenth century Dutch letters and found that more experienced writers tend to use more 昀椀xed expressions. They suggest that this indicates that formulas are partly used to reduce cognitive e昀昀ort. 2.2. Text Reuse Detection Text reuse detection has been studied extensively in the context of plagiarism detection. The annual PAN competitions, starting in 2009, have been a main driver for developing algorithms for plagiarism detection and text reuse detection [29, 27, 28, 43, 24]. Most research on text reuse detection focuses on modern texts, o昀琀en digital born and using modern language, which has rules for spelling and syntax. Detecting text reuse becomes more complicated for long serial archives covering historic documents from an extensive historical period in which the language used had no consistent spelling, spelling changed over time, and the digitisation of those documents introduces text recognition errors [45, 44]. In addition to plagiarism detection, textual repetition has been studied extensively in the con- text of text alignment, collation and comparison [10, 40, 15], and text reuse [45, 38]. But there is remarkably little previous work focusing on the identi昀椀cation of formulaic expressions. We found several digital humanities studies regarding structure in text in general [1, 2, 31, 36, 45, 38, 39] and some more speci昀椀c studies for technical text [8] and legal arguments [32, 11, 12]. In all these cases, the object of study is text repetition and the use of isolated speci昀椀c terminology (noun phrases) rather than the use of formulaic phrases. To the best of our knowledge, outside 129 of linguistics, humanities scholars have not written much about formulaic language use. Prob- ably, only serial use of textual features make formulas useful for study. Scholars who have to read through them tend to see them as repetitive textual features without relevant information content. 2.3. Issues with variation-rich text One of the big challenges of text analysis on corpora of historical texts is that they are rich in spelling variation. Many historical languages had no standard spelling and changed in spelling over time. Moreover, many texts extracted from digitised documents contain text recognition errors. These issues together lead to possibly many di昀昀erent spellings of the same word or phrase. This challenge can to some extent be addressed by normalising the spelling. This maps spelling variants of words to a standard or ‘normal’ spelling of the word. VARD2 [5, 16] is a lexicon-based technique that was originally developed for historical English but ported to di昀昀erent historical languages. TICCL [34, 35] was developed originally to automatically nor- malise very large collections of 19th and 20th century Dutch. A di昀昀erent approach is to use fuzzy string matching and searching starting from a list of known phrases [22]. There are re- cent techniques based on deep neural networks, like PIE [23], that can be trained to lemmatise variation-rich languages, resulting in ‘normalised’ lemmas. However, this also reduces mor- phological variation that can be meaningful in distinguishing between expressions. Another drawback is that this requires a large amount of training material of linguistically annotated text. Since we are using the same corpus as [22], we took inspiration from their fuzzy searching approach, but since it requires knowing the formulas in advance and the technique becomes very slow when a large number of formulas is used for searching, we decided to use a simpli昀椀ed approach of detecting common word n-grams and using character n-gram indexing to 昀椀nd orthographically similar spellings. 3. Formulaic expressions and their use We narrow down our object of research with a more precise but still pre-theoretical de昀椀nition of formulaic expressions and their context. A literature search has not given us any de昀椀nition of formulaic expressions beyond the notions of lexical bundles and idiomatic expressions in common language use [7, 26]. Lexical bundles are o昀琀en noun phrases and examples of domain speci昀椀c terminology. We take a information theoretical perspective, and need a de昀椀nition that is applicable to di昀昀erent corpora and that helps to identify formulaic expressions from large amounts of text. Given the nature of the corpus of resolutions and the corpus-speci昀椀city of its formulaic expressions, we need a de昀椀nition that takes into account that formulas tend to be longer phrases, though not necessarily complete clausal units, that can incorporate and give context to variable elements like names of persons, organisations and locations or dates. 130 3.1. Characteristics of Formulaic Expressions We de昀椀ne formulaic expression as a multi-word phrase (an extended lexical unit) that is reused o昀琀en across documents in a collection, with minimal word variation, but with potentially high variation in spelling.1 They may contain variable elements, spans consisting of e.g. entity names or dates. In the resolutions, a phrase might express that a certain type of proposal was submitted by a person, whose name is a variable element in the formula. As far as we can tell, this de昀椀nition captures the formulas found by Rutten and Wal as well. What constitutes a formulaic expression might di昀昀er across domains, genres or corpora. We will discuss this further at the end of this paper in Section 6, but note here that this de昀椀nition does not yet give us proper criteria for deciding what is a formula and what is not. Therefore, our research design is exploratory, rather than descriptive or explanatory [4, pp.91-92]. It serves to give us a better understanding of the phenomenon of formulaic expressions and how we can develop methods to study them, rather than to o昀昀er a precise description of how they are used or an explanation of how they emerge or evolve. The study of formulaic language use has identi昀椀ed a number of reasons for speakers and authors to use formulaic expressions [42]. Within the domain of legal and administrative texts, the most relevant are precision of communicated information (e.g. “Cleared for takeo昀昀” signals permission to enter a runway and commence takeo昀昀), and signalling the structure of discourse (e.g. “on the other hand” signals an opposition) [13, p.46]. The formulaic expressions that are used over and over in many historical legal and admin- istrative documents, serve as precise referents to the information that the document is to com- municate. They prevent di昀昀erences in interpretation, but also structure the information of the document. For instance, in a corpus of proclamations, a standard opening phrase signals where each proclamation starts. Notarial deeds o昀琀en have template text to indicate the role of the actors involved in the deed and that the contract is a legally binding agreement between the actors. In this way, formulaic expressions let us detect this structure. 3.2. Textual repetition across domains and genres We compare the corpus of resolutions against a set of historical and modern text corpora from various domains, to get insight in how textually repetitive it is and whether that is related to the domain and genre of administrative and legal texts. We use the following document collections to study the amount of textual repetition (Table 1). There are 昀椀ve corpora consisting of mostly administrative documents: • The Resolutions corpus contains 286,871 printed resolutions of the SG in the period 1705- 1796, with a Character Error Rate (CER) of 3%. • The Notarial Deeds of the city archive of Amsterdam contain legal transactions. We expect deeds to have at least a few formulaic expressions stating the nature of the transaction, the parties involved, and that it has been signed o昀昀 by a notary. We estimate the CER to be in around 10%. 1 The latter aspect is part of the de昀椀nition to account for historical variations in spelling of what is semantically or pragmatically the same phrase. 131 • The collection of Dutch medieval charters contains manual transcriptions of handwritten charters, which we expect to contain mostly formal language with low CER. • The Mandate and Police books from the city state of Bern cover the same period (and more) and also contain administrative and legal documents where we expect formal language use, but in a di昀昀erent language (German) and with a higher rate of text recognition errors (CER of around 20%). The two types of books represent di昀昀erent administrative sub- genres so we treat them as separate corpora. We compare these against 昀椀ve corpora with mostly free-form text from various domains, in- cluding two with historic Dutch language and three corpora with modern Dutch: • The general missives of the Verenigde Oostindische Compagnie (VOC, Dutch East India Company) consist of long business correspondences between the o昀케ces of the VOC in Asia and Amsterdam, which we expect are less formal and more free-form than the reso- lutions. These are handwritten documents, for which we could not obtain accurate CER information, but we estimate it to be in the range of 15-20%. • The Dutch Newspaper corpus of the National Library of the Netherlands contains over 700,000 articles from dozens of Dutch language newspapers in the 18th century. This corpus was OCR’ed around 2006 and has a CER of 15-20%. The articles are free-form and cover many topics, so we expect low repetition. • Dutch Wikipedia consisting of articles that are in principle free-form, but occasionally, entire databases of e.g. sports clubs, television shows or plant species are algorithmically turned into a set of article stubs using a template article. Although later manual edits of a template-based article tend to transform the template text to more free-form prose, some template phrasings may remain. • Dutch novels, a set of 10,921 recently published Dutch novels (text extracted from epubs). We expect these to be free-form with little repetition. • Book reviews is a set of 472,810 online Dutch book reviews [9, 21] from seven di昀昀erent reviewing platforms. Reviews are also free-form, although book reviews represent a very narrow domain with potentially many stock phrases, so we expect some form of repetition. A Vocabulary Growth Curve [3, 6] shows how much the frequency of vocabulary terms grows with respect to the fraction of terms in a collection that have been seen only once. By iterating over all paragraphs in a corpus in a random order, we count the total number of terms �㕁 seen so far (i.e. term tokens) and at 昀椀xed points—e.g. once every 1000 words—divide the size of the vocabulary �㕉 (�㕁 ) over the number of hapax legomena �㕉1 (�㕁 ), e.g. terms that have occurred only once so far. A higher value of �㕉�㕉 (�㕁 ) means more of the term frequency mass is taken by 1 (�㕁 ) terms that occur more than once. The vocabulary growth curves are shown in Figure 2 for frequencies of word n-grams with Ā ∈ [1, 3, 5]. The administrative corpora are shown with dashed lines, while the more free-form corpora are shown with dotted lines. Further, corpora with high CER are shown with dense lines (little horizontal space between the symbols), and the rest with more widely separated symbols. 132 Figure 2: Vocabulary Growth curves of word n-grams for seven corpora of Dutch text. The top plot shows word 1-grams, the middle shows word 3-grams and the bottom shows word 5-grams. The corpora we assume to have formal language are represented by dashed lines, the free-form ones by dotted lines. Corpora with high Character Error Rate (CER) are represented by narrowly separated symbols, those with low CER by widely separated symbols. 133 Table 1 Overview of document collections used to study textual repetition Collection name Domain Period # docs # words Historic Resolutions of the Dutch States General Administrative 1705-1796 286,871 58,430,762 Dutch East India Company missives Administrative 1637-1792 981,457 218,923,640 Dutch Notarial deeds (Amsterdam) Legal 1612-1833 93,262 29,298,606 Dutch medieval charters Administrative 1299-1345 3,522 639,804 Dutch newspapers News 1700-1799 705837 381,655,444 Bern region Mandate and Police books Administrative 1458-1798 21,820 6,257,187 Modern Wikipedia NL Reference 2005-2022 2,881,669 319,609,408 Dutch Novels Fiction 2010-2020 10,921 745,977,872 Online Dutch book reviews Reviews 1999-2020 472,810 57,970,421 For single words, the modern Dutch corpora have higher curves, meaning they have rela- tively more terms that occur more frequently, than the historic corpora based on algorithmic text recognition. This is not surprising, given the spelling variation and recognition errors in historic corpora. Both phenomena increase the size of the vocabulary and thereby lead to more mass at to the hapax legomena. Novels and reviews have the highest curves. We speculate that novels tend to use mostly common vocabulary to be easy to read by a large audience, while book reviews are a speci昀椀c domain and genre, so use a relatively narrow vocabulary. Wikipedia contains articles about a huge range of topics, so it is understandable that it has a longer tail of hapax legomena. The medieval charters use a very limited vocabulary and have a lot of term repetition. As it is based on manual transcription, we assume the rate of recognition errors to be much lower than for the corpora based on OCR or HTR. The low/high CER distinction corresponds to a clear di昀昀erence, with all low CER corpora having much higher curves. For word 3-grams, the resolutions and Bern police and mandate books have curves that fall o昀昀 very little, signalling that, although they have many single term hapax legomena, they have relatively many word 3-grams that occur more than once, compared to the other corpora. The curve for online book reviews overtakes the resolutions curve a昀琀er around 500,000 tokens, sug- gesting indeed that additional reviews introduce relatively few new 3-grams and that reviews are therefore relatively similar to each other. For word 5-grams, the top curves are those of the charters, Bern police and mandate books, the resolutions, notarial deeds and the book reviews. With the exception of the latter, they are all in the domains of legal and administrative texts. These curves support our intuition that texts in the legal domain have relatively many re- peated phrases, despite the spelling variation and character recognition mistakes. 4. Modelling Formulaic Expressions How frequent should a phrase be to be considered a formulaic expression? There can be for- mulaic expressions that are borrowed from other domains or genres, that are used with low frequency in the collection in which formulaic expressions are analysed. We leave these out of 134 the scope of this paper, as we want to focus on expressions that are frequent enough that they can be used as metadata that cover most of the resolutions. Identifying borrowed formulas requires knowledge or analysis of external resources. We want to 昀椀nd word sequences that occur frequently. The simplest way would be to count the frequencies of word n-grams for some range of Ā, similar to the word n-gram analysis of Section 3.2. However, the number of n-gram types grows rapidly as Ā increases, and the vast majority of these occur only once or twice. The corpus of resolutions has 735,919 distinct words, so the number of word 1-gram types is the same, but for Ā = 4, the number of n-gram types is 23,985,191. We exploit the fact that frequently occurring phrases can only consist of words that individually occur at least as frequently as the phrases themselves. That is, phrases that occur 100 times in the collection must consist of words that occur at least 100 times. To 昀椀nd phrases that occur at least ĂℎĄ�㕎ą�㕒�㕓 Ą�㕒ăÿ�㕖Ā = 100, we can exclude all candidate phrases that contain words with a corpus frequency below this phrase frequency threshold. Furthermore, the words within a phrase also co-occur with each other at least 100 times within a window that is equal to the word length of the phrase. With these observations in mind, we developed a naive algorithm for identifying candidate formulaic expressions. In the pre-processing phrase, candidate phrases of 昀椀xed length are de- tected, a昀琀er which their contexts of preceding and following words are clustered and analysed to extend the partial formulas and identify their start and end boundaries. The parameterisa- tion we arrive at is ad hoc and speci昀椀c this the corpus of resolutions, since we have no precise de昀椀nition yet of what makes a phrase formulaic in a particular context. The goal is to explore the corpus with a pre-theoretical notion of what we are looking for. Concretely, the pre-processing phase consist of the following steps: 1. Tokenise each resolution into sentences and sentences into words. 2. Iterate over the corpus and count frequencies of individual words 3. Iterate over the corpus a second time, and replace each word with a variable token if it either has 1) a term frequency Ć�㕒Ąÿ�㕓 Ą�㕒ă(�㕤�㕖 ) < ĂℎĄ�㕎ą�㕒�㕓 Ą�㕒ăÿ�㕖Ā , or 2) a co-occurrence frequency �㕐āā�㕐�㕓 Ą�㕒ă(�㕤�㕖 , �㕤�㕗 ) < ĂℎĄ�㕎ą�㕒�㕓 Ą�㕒ăÿ�㕖Ā with at least one of its remaining neighbour- ing �㕁 = 5 words �㕤�㕗 ∈ (�㕤�㕖−�㕁 , �㕤�㕖+�㕁 ) on either side. 4. Slide a 5-word window over each individual sentence, extract the 5-word window as a phrase if it contains no variable token, and count the frequency of each extracted phrase. The process is demonstrated in detail in Appendix A. In the extension phase, we reduce the set of candidate phrases to a set of formulas in two steps. First, we use fuzzy string matching to 昀椀nd clusters of candidate phrases that are spelling variations of each other. In the second step, we gather the contexts around each occurrence of a cluster of phrases, and count how o昀琀en the phrases are preceded and followed by the same sequence of words. 4.1. Clustering phrase spelling variants Many of the common word 5-grams are spelling variants of each other. We cluster them by indexing these word 5-grams as vectors of character 1-skip-2-grams. That is, we consider not only 2 adjacent characters, but also pairs of characters that are separated by another character. 135 Starting from the most frequent word 5-gram phrases, we query the index to 昀椀nd candidate variants using cosine similarity. Further details are provided in Appendix B. 4.2. Extending partial formulas Next, we build frequency lists of the 8 words preceding and following the 昀椀xed length phrase and use transition probabilities to identify extensions that have a probability close to 1 of pre- ceding or following the phrase. This is inspired by probabilistic language models based on Hidden Markov Models [30, 17]. We note that there might be formulas shorter than 5 words. These can be detected by starting with shorter 昀椀xed length phrases. Because the preceding (pre昀椀x) and following (post昀椀x) contexts can include clusters of spelling variants as well, we use the same fuzzy matching algorithm as used for clustering the phrases. We then split all 8-word contexts into sequences of words and calculate transition probabilities for pre昀椀x and post昀椀x contexts separately, starting from the 昀椀xed length phrase to the word immediately preceding or following it, and from that word to the next word, etc. Words that occur in multiple pre昀椀x or post昀椀x contexts thereby have a higher transition proba- bility. Words that have a probability below 0.1 are considered to be not part of the formulaic expression and are replaced by a token. Once all transition probabilities have been com- puted, we traverse the transition model starting from the 昀椀xed length phrase and consider preceding words part of the formulaic expression if the probability is above 0.9. Once the prob- ability drops below 0.9 but is still above 0.1, we assume to have reached a common context of the formula that is not part of the formula itself. We repeat the same process for the post-phrase context, again, computing transition probabilities starting from the 昀椀xed phrase. This process is described in more detail in Appendix B. 5. Results We start with the 23,141 candidate phrases that we got from using a minimum frequency thresh- old of 100. Clustering variant phrases reduces this to 12,880 clusters. The most frequent phrase is considered the representative variant. Extending these phrases with preceding and following words with a transition probability above 0.9 results in 11,497 candidate formulas and 52,348 common extensions (with cumulative transition probabilities 0.1 ≤ ĂĆĄ�㕎Āą < 0.9). The 10 most frequent phrases are shown in Table 2, together with their corpus frequencies and the formulas that were derived from analysing their contexts. The phrase ‘ont- fangen een missive van’ is the most frequent 昀椀xed-length phrase that is also the most frequent formula (there is no extension that has a transition probability above 0.9). A less frequent and partially overlapping phrase is ‘ontfangen een missive van den’, which in the extension step is extended to ‘ ontfangen een missive van den’, that is, the ‘ ’ token is added to it. The 昀椀rst two formulas therefore also partially overlap, but the second is an extension of the 昀椀rst. The word ‘den’ (EN: the) is the most common continuation of the 昀椀rst formula, but other common continuations are names of persons, so the second formula is less ‘formulaic’ than the 昀椀rst. Phrases 3 and 4 lead to the exact same formula, as do phrases 5, 7, 8 and 9. Phrase 10 is also a partial overlap with these phrases, but because it includes the word ‘dat’ (EN: that) 136 Table 2 The 10 most frequent word 5-gram phrases, their frequency and their associated formulas that were identified based on their contexts. Rank Initial phrase Freq. Formula 1 ontfangen een missive van 138682 ontfangen een missive van 2 ontfangen een missive van den 107679 ontfangen een missive van den 3 geen resolutie is gevallen 90265 waar op geen resolutie is gevallen 4 op geen resolutie is gevallen 89528 waar op geen resolutie is gevallen 5 waar op gedelibereert zynde is 86621 waar op gedelibereert zynde is goedgevonden en verstaan 6 waar op geen resolutie is 68153 waar op geen resolutie is gevallen 7 op gedelibereert zynde is goedgevon- 67601 waar op gedelibereert zynde is den goedgevonden en verstaan 8 zynde is goedgevonden en verstaan 66061 waar op gedelibereert zynde is goedgevonden en verstaan 9 gedelibereert zynde is goedgevonden 64207 waar op gedelibereert zynde is en goedgevonden en verstaan 10 is goedgevonden en verstaan dat 55974 zynde is goedgevonden en verstaan dat at the end — which is again a common follow up, but not the only one — it is not extended to the same formula. We can further cluster these formulas by re-categorising the less frequent formulas that are extended or reduced versions of more frequent formulas as common extensions. If we perform this re-categorisation, the list of 11,497 formulas is reduced 7,153 formulas. Eyeballing the list of remaining formulas reveals there are still many spelling variants in the list. This shows that spelling variant is a challenging problem that needs to be analysed in more detail. 5.1. Analysis of formulas Almost 58% of the candidate formulas are longer than 5 words. Of course, our choice to start from candidate phrases of 5 words ensures no formulas shorter than 5 words are found, but the fact that more than half are extended shows that formulaic expressions in the resolutions are long syntactic units consisting of more than compound nouns and noun phrases. The 10 formulas resulting from clustering the most frequent formulas are shown in Table 3. The last column describes how we can use these formulas to identify meaningful elements in the running text, and how they help in classifying these elements. It is worth noticing that many formulas precede of follow a named entity, suggesting that formulas were frequently used to assert the relationship of that entity to the proposition or decision. Several formulas also contain verbs (has been read, has been agreed and understood, to report back) that signal speci昀椀c actions. In future work, we will analyse the types of relationships and actions that these formulas express. 137 Table 3 The top 10 most frequent formulas and how they structure and identify elements of the text. Rank Formula Translation Signal 1 ontfangen een missive received a missive of this is the start of a resolution van and the start of the proposal paragraph, the proposal docu- ment type is missive 2 waar op geen resolutie is on which no resolution was this is the end of the resolution, gevallen made no decision was made 3 waar op gedelibereert zynde is which, upon deliberation, has start of the decision paragraph goetgevonden en verstaan been agreed and understood 4 en van alles alhier ter ver- and to report back on every- decision to start a investigative gaderinge rapport te doen thing, here in the meeting committee that will report back at a later date, signal that there is a future resolution related to this one 5 en andere haar hoogh mogende and other high and mighty name preceding this formula is gedeputeerden tot de deputies of a deputy, name following this formula is an institution, the deputy is a representative of the institution 6 de heeren gedeputeerden van the gentlemen deputies of the the name following this for- de mula is a province or institution 7 in handen van de heeren in the hands of the gentlemen decision that the matter is handed to a committee to in- vestigate, the name(s) following this formula are the members of this committee 8 haar hoogh mogende resolutie resolution of her high and what follows is a date of a previ- van den mighty of the ous resolution, the previous res- olution is related to this resolu- tion 9 aan het hof van sijne at the court of his the name preceding this for- mula is a representative of the court of the name following this formula 10 Is ter vergaderinge gelesen de Has been read in this meeting, this is the start of a resolution requeste van the petition of and the start of the proposal paragraph, the proposal docu- ment type is missive Applying the approach to other corpora is elaborated in Appendix E. 138 5.2. Challenges of Evaluation The detection approach above is exploratory, as we have no precise de昀椀nition of what a for- mulaic expression is. We need clear criteria to determine if a phrase is formulaic before we can precisely de昀椀ne the task of formula detection and quantitatively evaluate methods designed to perform that task. For a proper evaluation of the detection method the entire corpus needs to be annotated with all formulas. We could reduce the problem by focusing on a small sample of resolutions and annotate anything that we think is a formula, but we would still need clear criteria to decide what is a formula and what is not. Another alternative is to use simulation and generate text and insert arti昀椀cially generated formulaic expressions to fully control the characteristics of formula, including their length, variation and frequency of occurrence. Intuitively, a quantitative evaluation should consider precision and recall of detecting for- mulaic expressions, with two di昀昀erent measures of recall. One is the fraction of di昀昀erent types of formulaic expressions that are identi昀椀ed (type-based recall), and the other is the fraction of occurrences of formulaic expressions that are detected (token-based recall). 6. Discussion and Conclusions Formulas and their use have relevance to a number of both methodological and humanities research questions. Because formulaic expressions and their use for text structuring are under- studied, for a large part we can only raise these research questions. Our 昀椀ndings of the use of repetitive phrases in various corpora suggests that repetitive lan- guage use di昀昀ers strongly across domains and genres, with texts in administrative domains containing more repetitive phrasing. It is not clear whether this is true for all or just a spe- ci昀椀c type of administrative texts. Our 昀椀ndings with detecting formulas suggest that this type of serial government sources may contain many formulaic expressions that can be exploited to structure texts and extract information. Further research has to shed light on the extent to which this also holds for other administrative sources and how the composition and diversity of collections relates to the statistical properties of formulas. A second type of research questions centres on the use of formulas. We do not know whether comparable administrative archives from di昀昀erent periods have the same rate of repetitive phrases and formulas. We observed that formulas emerged, changed and disappeared over time, but not at what rate and what the causes were. We do not know if there was an increase in adoption of formulaic expressions in the resolutions. There could have been changes in legal or administrative customs, more gen- eral language and cultural changes or perhaps even in昀氀uences of speci昀椀c scribes. The switch to the use of printing would lead us to assume that there was less variation in the formulas used, but it is too early to test this assumption. It is also an open question where the formulas originate, and whether formulas are reused across (administrative) domains. A formula bor- rowed from another domain might not be used o昀琀en in a corpus, in which case the method and de昀椀nition we developed in this paper do not su昀케ce. So further discussion is needed on what constitutes a formulaic expression, and what its generic and context-speci昀椀c elements are. As for the content of formulas, we can only make some remarks about those that occur in the resolutions. We have been able to spot a great number of formulas but it is hard to give a precise de昀椀nition of what a formula is, or to establish if we can discern constituent elements 139 and if it makes sense to divide up formulas in these constituent parts. In this paper we describe a methodology in development, that iteratively gathers formulas from our corpus. This works well for identifying the most frequently used formulas and their variations. Further steps have to make clear what the optimal rate of formula detection is, how to categorise the di昀昀erent formulas, and if they are su昀케cient, to identify and localise di昀昀erent logical elements in the text and whether this works for all resolutions. A further point of discussion is how much of the methodology can be used for other corpora. We believe that the general methodology is suitable for comparable text corpora. 7. Acknowledgments This work is part of the REPUBLIC project (2019-2024), a Research Infrastructure project funded by the Dutch Research Council (NWO, Grant number 175.217.024). References [1] G. Altmann and R. Köhler. “Forms and Degrees of Repetition in Texts”. In: Forms and Degrees of Repetition in Texts. De Gruyter Mouton, 2015. [2] P. Auslander. “On Repetition”. In: Performance Research 23.4-5 (2018), pp. 88–90. [3] H. Baayen. “The e昀昀ects of lexical specialization on the growth curve of the vocabulary”. In: Computational Linguistics 22.4 (1996), pp. 455–480. [4] E. R. Babbie. The practice of social research. Cengage learning, 2020. [5] A. Baron and P. Rayson. “VARD2: A tool for dealing with spelling variation in historical corpora”. In: Postgraduate conference in corpus linguistics. 2008. [6] M. Baroni and S. Evert. “The zipfR package for lexical statistics: A tutorial introduction”. In: Available atzipfr. r-forge. r-project. org/materials/zipfrtutorial. pdf [last accessed1 June 2019] (2014). [7] D. Biber, S. Johansson, G. Leech, S. Conrad, E. Finegan, and R. Quirk. Longman grammar of spoken and written English. Vol. 2. Longman London, 1999. [8] B. Boguraev and C. Kennedy. “Technical Terminology for Domain Speci昀椀cation and Con- tent Characterisation”. In: Scie. 1997. doi: 10.1007/3-540-63438-x\_5. [9] P. Boot. “A Database of Online Book Response and the Nature of the Literary Thriller.” In: Dh. 2017. [10] R. L. Cannon. “OPCOL: An Optimal Text Collation Algorithm”. In: Computers and the Humanities (1976), pp. 33–40. [11] D. S. Carvalho, M.-T. Nguyen, C.-X. Tran, and M.-L. Nguyen. “Lexical-Morphological Modeling for Legal Text Analysis”. In: JSAI International Symposium on Arti昀椀cial Intelli- gence. Springer, 2015, pp. 295–311. 140 [12] D. S. Carvalho, V. D. Tran, K. Van Tran, V. D. Lai, and M.-L. Nguyen. “Lexical to Discourse- Level Corpus Modeling for Legal Question Answering”. In: Tenth International Workshop on Juris-Informatics (JURISIN). 2016. [13] K. Conklin and N. Schmitt. “The processing of formulaic language”. In: Annual Review of Applied Linguistics 32 (2012), pp. 45–61. [14] A. P. Cowie. Phraseology: Theory, analysis, and applications. OUP Oxford, 1998. [15] R. Haentjens Dekker, D. Van Hulle, G. Middell, V. Neyt, and J. Van Zundert. “Computer- supported collation of modern manuscripts: CollateX and the Beckett Digital Manuscript Project”. In: Digital Scholarship in the Humanities 30.3 (2015), pp. 452–470. [16] I. Hendrickx and R. Marquilhas. “From Old Texts to Modern Spellings: An Experiment in Automatic Normalisation.” In: J. Lang. Technol. Comput. Linguistics 26.2 (2011), pp. 65– 76. [17] F. Jelinek. Statistical methods for speech recognition. MIT press, 1998. [18] A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov. “Bag of tricks for e昀케cient text clas- si昀椀cation”. In: arXiv preprint arXiv:1607.01759 (2016). [19] D. Jurafsky and J. H. Martin. “Speech and Language Processing: An introduction to speech recognition, computational linguistics and natural language processing”. In: Up- per Saddle River, NJ: Prentice Hall (2008). [20] F. Karsdorp. “Het is groen en lee昀琀 nog lang en gelukkig. Classi昀椀catie van volksverhaal- genres op basis van formules”. In: Tijdschri昀琀 voor Nederlandse Taal-en Letterkunde 129.4 (2014), pp. 274–288. [21] M. Koolen, P. Boot, and J. J. van Zundert. “Online Book Reviews and the Computational Modelling of Reading Impact”. In: Proceedings of the Workshop on Computational Human- ities Research (CHR 2020). Vol. 2723. 2020, p. 0073. url: http://ceur-ws.org/Vol-2723/lon g13.pdf. [22] M. Koolen, R. Hoekstra, I. Nijenhuis, R. Sluijter, E. van Gelder, R. van Koert, G. Brouwer, and H. Brugman. “Modelling Resolutions of the Dutch States General for Digital Histor- ical Research.” In: Colco. 2020, pp. 37–50. [23] E. Manjavacas, Á. Kádár, and M. Kestemont. “Improving Lemmatization of Non-Standard Languages with Joint Learning”. In: Proceedings of the 2019 Conference of the North Amer- ican Chapter of the Association for Computational Linguistics: Human Language Technolo- gies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Compu- tational Linguistics, 2019, pp. 1493–1503. doi: 10.18653/v1/N19-1153. url: https://www .aclweb.org/anthology/N19-1153. [24] V. T. Martins, D. Fonte, P. R. Henriques, and D. d. Cruz. “Plagiarism detection: A tool survey and comparison”. In: (2014). [25] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. “Distributed representations of words and phrases and their compositionality”. In: Advances in neural information processing systems 26 (2013). 141 [26] M. Poß and T. v. d. Wouden. “Extended lexical units in Dutch”. In: LOT Occasional Series 4 (2005), pp. 187–202. [27] M. Potthast, A. Eiselt, L. A. Barrón Cedeño, B. Stein, and P. Rosso. “Overview of the 3rd international competition on plagiarism detection”. In: CEUR workshop proceedings. Vol. 1177. CEUR Workshop Proceedings. 2011. [28] M. Potthast, M. Hagen, T. Gollub, M. Tippmann, J. Kiesel, P. Rosso, E. Stamatatos, and B. Stein. “Overview of the 5th international competition on plagiarism detection”. In: CLEF Conference on Multilingual and Multimodal Information Access Evaluation. Celct. 2013, pp. 301–331. [29] M. Potthast, B. Stein, E. Andreas, and A. B.-C. P. Rosso. “Overview of the 1st international competition on plagiarism detection”. In: 3rd PAN Workshop. Uncovering Plagiarism, Au- thorship and Social So昀琀ware Misuse. 2009, p. 1. [30] L. R. Rabiner. “A tutorial on hidden Markov models and selected applications in speech recognition”. In: Proceedings of the IEEE 77.2 (1989), pp. 257–286. [31] G. E. Raney. “A Context-Dependent Representation Model for Explaining Text Repetition E昀昀ects”. In: Psychonomic Bulletin & Review 10.1 (2003), pp. 15–28. [32] R. Rashidi-Tabrizi, G. Mussbacher, and D. Amyot. “Legal Requirements Analysis and Modeling with the Measured Compliance Pro昀椀le for the Goal-Oriented Requirement Language”. In: 2013 6th International Workshop on Requirements Engineering and Law (RELAW). Ieee, 2013, pp. 53–56. [33] R. Rehurek and P. Sojka. “Gensim–python framework for vector space modelling”. In: NLP Centre, Faculty of Informatics, Masaryk University, Brno, Czech Republic 3.2 (2011), p. 2. [34] M. Reynaert. “Non-interactive OCR post-correction for giga-scale digitization projects”. In: International Conference on Intelligent Text Processing and Computational Linguistics. Springer. 2008, pp. 617–630. [35] M. Reynaert. “TICCLops: Text-Induced Corpus Clean-up as online processing system”. In: Proceedings of coling 2014, the 25th international conference on computational linguis- tics: System demonstrations. 2014, pp. 52–56. [36] S. Ruecker, M. Radzikowska, P. Michura, C. Fiorentino, and T. Clement. “Visualizing Repetition in Text”. In: Digital Studies/Le champ numérique 1.3 (2009). [37] G. Rutten and M. J. van der Wal. “Functions of epistolary formulae in Dutch letters from the seventeenth and eighteenth centuries”. In: Journal of Historical Pragmatics 13.2 (2012), pp. 173–201. [38] H. Salmi, P. Paju, H. Rantala, A. Nivala, A. Vesanto, and F. Ginter. “The reuse of texts in Finnish newspapers and journals, 1771–1920: A digital humanities perspective”. In: Historical Methods: A Journal of Quantitative and Interdisciplinary History 54.1 (2020), pp. 14–28. [39] M. A. Samkova. “Repetition and Intertextuality as Modalities of Text Structuring and perception”. In: Facta Universitatis, Series: Linguistics and Literature 0 (2016), pp. 95–105. 142 [40] D. Schmidt and R. Colomb. “A data structure for representing multi-version texts online”. In: International Journal of Human-Computer Studies 67.6 (2009), pp. 497–514. [41] N. Schmitt. Formulaic sequences: Acquisition, processing, and use. Vol. 9. John Benjamins Publishing, 2004. [42] N. Schmitt and R. Carter. “Formulaic sequences in action”. In: Formulaic sequences: Ac- quisition, processing and use (2004), pp. 1–22. [43] E. Stamatatos. “Intrinsic plagiarism detection using character n-gram pro昀椀les”. In: thresh- old 2.1,500 (2009). [44] A. Vesanto, F. Ginter, H. Salmi, A. Nivala, and T. Salakoski. “A system for identifying and exploring text repetition in large historical document corpora”. In: Proceedings of the 21st Nordic Conference on Computational Linguistics. 2017, pp. 330–333. [45] A. Vesanto, A. Nivala, H. Rantala, T. Salakoski, H. Salmi, and F. Ginter. “Applying BLAST to text reuse detection in 昀椀nnish newspapers and journals, 1771-1910”. In: Proceedings of the NoDaLiDa 2017 Workshop on Processing Historical Language. 2017, pp. 54–58. [46] D. Wood. “Uses and functions of formulaic sequences in second language speech: An exploration of the foundations of 昀氀uency”. In: Canadian Modern Language Review 63.1 (2006), pp. 13–33. [47] A. Wray. Formulaic language and the lexicon. Eric, 2002. 143 8. Appendix A. Identifying candidate partial formulas We demonstrate the processing steps to identify candidate partial formulas using the following example sentences: Ontfangen een Missive van het Collegie ter Admiraliteyt in Zeelandt, geschreven te Middelburgh den negentienden deser loopende maandt, houdende, in gevolge en tot voldoeninge van haar Hoogh Mogende Resolutie van den vyfden der voorlede maandt, der zelver advis op het verzoeck van Burgermeesters en Scheepenen van het hooge en laage Zas van Gent. With word tokenisation, we add and tokens so that for words at the start or end of a sentence, this boundary is included in its context. Certain phrases only appear at or near the start or end of a sentence, and adding boundary tokens allows us to keep track of such cases. A昀琀er 昀椀ltering out low frequency words and words that do not meet the co-occurrence fre- quency threshold, we end up with the following list: [‘ ’, ’ontfangen’, ’een’, ’missive’, ’van’, ’het’, ’collegie’, ’ter’, ’admiraliteyt’, ’in’, ‘’, ’geschreven’, ’te’, ‘’, ’den’, ‘’, ’deser’, ’loopende’, ’maandt’, ’houdende’, ’in’, ‘gevolge’, ’en’, ’tot’, ’voldoeninge’, ’van’, ’haar’, ’hoogh’, ’mogende’, ’resolutie’, ’van’, ’den’, ‘’, ’der’, ’voorlede’, ’maandt’, ’der’, ’zelver’, ’advis’, ’op’, ’het’, ‘’, ’van’, ‘’, ’en’, ‘’, ’van’, ’het’, ‘’, ’en’, ‘’, ‘’, ’van’, ‘’, ‘ ’] The example above shows that most of the words in the sentence occur frequently and co- occur with each other frequently. This results in a large number of candidate phrases. In the resulting sentence, we extract candidate phrases Ă(�㕤�㕖 , �㕤�㕖+5 by identifying sequences of word tokens (i.e. non-variable tokens) of length |Ă| = 5 and count their frequency over the entire corpus. Table 4 The impact of the minimum term and co-occurrence frequency thresholds on the vocabulary and num- ber of co-occurring word pairs and number of phrases. Min. Freq. Total 5-word phrases Phrases above Threshold Vocab. size Co-oc pairs Types Tokens Threshold 1 614,829 27,345,551 30,261,003 57,027,160 30,261,003 10 71,151 19,260,903 24,270,577 50,687,254 282,486 100 17,021 12,025,517 16,694,381 41,427,641 23,141 1000 3,547 3,993,786 6,839,379 26,668,452 2,119 10,000 605 316,686 756,629 10,435,346 140 144 As little is known in advance about the relationship between frequencies of words, word co-occurrence and formulas, we have few meaningful clues for setting a minimum phrase fre- quency. The corpus of resolutions has 286,871 resolutions and over 58 million words, but we have no reliable estimates of the total number of formulaic expressions that are used and how o昀琀en they occur. In identifying the start of proposition paragraphs of resolutions, [22] use a list of 32 formulas, most of which are between 5 and 10 words long. But we do not know if these are all the formulas used for openings of proposition and decision paragraph. We also do not know which other elements of a resolution are expressed in formulas. To get some insight, we experimented with minimum frequencies of di昀昀erent orders of mag- nitude, ranging from 1 to 10,000. The results are shown in Table 4, with for each frequency threshold, the size of the vocabulary (number of distinct words), the number of distinct co- occurring word-pairs within a 5 word window, the number of distinct 5-word phrases (phrase types) a昀琀er pre-processing, the total number of phrases (phrase tokens) and the number of phrases that meet the frequency threshold. Frequency thresholds of 1 and 10 lead to large vocabularies and huge numbers of phrases. If they are to be made useful as metadata labels or boundary signals, we need to analyse each of them in their context manually, so these thresholds result in an unmanageable number of phrases. On the other extreme, a threshold of 10,000 leads to a very small vocabulary of 605 highly common words and only 140 highly frequent phrases. The most common phrase is Ontfangen een Missive van’ (EN: Received a missive of), which is also the start of the example sentence above, with a frequency of 138,682. Note that there are variant spellings of this phrase that are not included in the count. With such a high frequency, it is clear that this phrase is part of a 昀椀xed formula, but because we made 昀椀xed-length phrases, it is not clear whether this the entire formula. It is quite likely that some formulas contain words with a frequency below 10,000, and be- cause we know so little yet about the usage of formulas, it is better to be conservative and choose a lower threshold. For the rest of this paper, we pick a minimum frequency threshold ĂℎĄ�㕎ą�㕒�㕓 Ą�㕒ăÿ�㕖Ā = 100. This gives us a large set of 23,141 frequent phrases that are candidate formulaic expression or part of them. B. From candidate phrases to formulaic expressions The process of transforming the set of candidate phrases to a set of formulas has two main steps. First, we use fuzzy string matching to 昀椀nd clusters of candidate phrases that are spelling variations of each other. In the second step, we gather the contexts around each occurrence of a cluster of phrases, and count how o昀琀en the phrases are preceded and followed by the same sequence of words. B.1. Clustering variant phrases We index each word 5-gram phrase as a vector of character 1-skip-2-grams. That is, we con- sider not only 2 adjacent characters, but also pairs of characters that are separated by another character. The reason to include a single skip is that spelling variants o昀琀en have di昀昀erences in 145 characters in the middle of a word, which results in multiple ngram mismatches and thus few matches when no skips are used. Starting from the most frequent word 5-gram phrases, we query the index to 昀椀nd candidate variants using cosine similarity. Further details are provided in Appendix B. We limit the can- didate set to phrases that di昀昀er in length with the query phrase by at most 2 characters, based on the assumption that much longer or shorter phrases are unlikely to be spelling variants. We 昀椀lter the candidate phrases by checking that the words in each position 1..5 of the candidate phrase di昀昀er in length no more than 2 characters with their aligned words in the query phrase. This avoids matching a query phrase ‘op gedelibereert zynde is goetgevonden’ with its partial overlap ‘gedelibereert zynde is goetgevonden en’. The latter is typically the next 5-word phrase following the former, but because the query phrase has a short 昀椀rst word and the candidate phrase has a short last word, their ngram similarity is high. Using the length restriction on aligned words 昀椀lters out such erroneous matches. B.2. Extending candidate formulas In the extension step, we build frequency lists of the 8 words preceding and following the 昀椀xed length phrase and use transition probabilities to identify extensions that have a probability close to 1 of preceding or following the phrase. This is a similar approach to probabilistic language modelling [19], where the probably that a word �㕤�㕖 is followed by word �㕤�㕗 , is calculated as: �㕃(�㕤�㕖 , �㕤�㕗 ) �㕃(�㕤�㕗 |�㕤�㕖 ) = (1) �㕃(�㕤�㕖 ) This models the prediction of the next word only the current word. A natural extension is to model the prediction based on all words preceding it: �㕃(�㕤1 , �㕤2 , ..., �㕤�㕖 , �㕤�㕗 ) �㕃(�㕤�㕗 |�㕤1 , �㕤2 , ..., �㕤�㕖 ) = (2) �㕃(�㕤1 , �㕤2 , ..., �㕤�㕖 ) For extending phrases, we use this model, where the probability �㕃(ĂℎĄ) of a phrase ĂℎĄ consisting of words �㕤�㕖 , ..., �㕤�㕗 is �㕃(�㕤�㕖 , ..., �㕤�㕗 ). The transition probability from a phrase ĂℎĄ to a �㕃(ĂℎĄ,�㕤 ) �㕃(�㕤 ,ĂℎĄ) word �㕤�㕖 is �㕃(ĂℎĄ)�㕖 and from a word �㕤�㕖 to ĂℎĄ is �㕃(ĂℎĄ) �㕖 . We split all 8-word contexts into sequences of words and calculate transition probabilities for pre昀椀x and post昀椀x contexts separately, starting from the 昀椀xed length phrase to the word immediately preceding or following it, and from that word to the next word, etc. Words that occur in multiple pre昀椀x or post昀椀x contexts thereby have a higher transition probability. Words that have a probability below 0.1 are considered to be not part of the formulaic expression and are replaced by a token. Once all transition probabilities have been computed, we traverse the transition model starting from the 昀椀xed length phrase and consider preceding words part of the formulaic expression if the probability is above 0.9. Once the probability drops below 0.9 but is still above 0.1, we assume to have reached a common context of the formula that is not part of the formula itself. We repeat the same process for the post-phrase context, again, computing transition probabilities starting from the 昀椀xed phrase. 146 We extend the 昀椀xed length phrase with up to 8 words preceding it and up to 8 words follow- ing it, which means we can identify formulas of 8 + 5 + 8 = 21 words in a single pass. If the full 8-word path preceding or following the phrase has a cumulative transition probability close to 1, we repeat this extension process with the extended phrase to identify the boundary of the formula. An example of extending the phrase ’gevolge en tot voldoeninge van’ using transition prob- abilities is shown in Figure 3. On the le昀琀 side, the pre昀椀x context is shown, with the word ’in’ being the only word directly preceding the partial phrase. This means that ’gevolge en tot voldoeninge van’ is not the full formulaic expression, but that ’in’ is also part of it. The expres- sion ’in gevolge en tot voldoeninge van’ is also a syntactically more comprehensible phrase, meaning in consequence and ful昀椀lment of. There are multiple possible words preceding it. Two common ones are the verbs ’hebbende’ (EN: having), which precedes the phrase ’in gevolge en tot voldoeninge van’ in 38% of the occurrences of the phrase, and ’houdende’ (EN: main- taining) which precedes the phrase in 47% of its occurrences. The remaining occurrences of the phrase are preceded by a variety of other words. Each of these two verbs have multiple possible pre昀椀xes, but are themselves not part of the formula according to the de昀椀nition above, because their transition cumulative transition probabilities (1.0∗0.47 = 0.47 and 1.0∗0.38 = 0.38 respectively) do not meet the threshold of 0.9. On the right side, the post昀椀x context is shown, with two variants of continuations, neither of which is part of the formulaic expression itself. One common continuation is ’der selver resolutie’ and the other is ’haar hoog mogende resolutie’. Note that although neither path to the word ’resolutie’ has itself a cumulative transition probability above 0.9 (0.39 ∗ 0.99 ∗ 0.99 = 0.38 for the former and 0.59 ∗ 0.99 ∗ 0.99 ∗ 0.96 = 0.56 for the latter), their combined probabilities add up to 0.94. This meets the 0.9 probability threshold, thereby being an example of a formulaic expression that can have a variable middle part. In some cases that variable part is ’der selver’ and in others it is ’haar hoogh mogende’. The 昀椀nal formula is determined by extending the 昀椀xed length phrase with preceding and following words that have a cumulative probability of at least 0.9. In the case of the phrase ’gevolge en tot voldoeninge van’, we extend with the pre昀椀x ’in’ to ’in gevolge en tot voldoeninge van’ C. The impact of spelling change One of the big hurdles is spelling change. Finding phrases that are orthographically similar to each other is not hard, but phrases o昀琀en contain highly frequent, short function words that require only few character edits to transform one function word into another. This makes it di昀케cult to distinguish cases where two function words are variant spellings of each other from cases where they represent di昀昀erent words and therefore signal that these phrases have di昀昀erent meanings. We experimented with both classic Word2Vec [25] and fastText [18] CBOW embeddings2 to identify variant spellings of words, choosing the latter as it has better performance on a test set of target words and their variants. FastText uses character-level embeddings that are more 2 For both models we use their implementations in Gensim 4.0 [33], see https://radimrehurek.com/gensim/ 147 Figure 3: Extending the phrase ’gevolge en tot voldoeninge van’ using transition probabilities. The prefix probabilities are on the le昀琀, the postfix probabilities on the right. suitable for detecting spelling variations. We use embeddings based on the assumption that variants occur in the same or similar contexts so should end up in the same region in the em- bedding space. This works for spelling variants that are used interchangeably in a single time window. An example in the Resolutions corpus is the word ’en’ (EN: and) and its variant ’ende’. 148 Figure 4: The distribution of yearly frequencies of partial phrases in the 18th century resolutions. There is a period in the 18th century when their uses overlap, as can be seen in the temporal frequency distributions of the variant phrases ’gevolge en tot voldoeninge van’ (EN: follow- ing and in ful昀椀lment of) and ’gevolge ende tot voldoeninge van’ (see the le昀琀 side of Figure 4). The two versions ‘en’ and ‘ende’ share many contexts, so their word embeddings are similar. Using a combination of orthographic similarity (edit distance) and embedding similarity, we can identify word pairs in the corpus that can be linked as variants. However, experiments on 昀椀nding variant spellings of words in the pre- and post-context of phrases have shown that for short function words, spelling change is a major hurdle for both Word2Vec and FastText. But for spelling changes where one variant is used in only one period and the other only in another, non-overlapping period, their contexts can also have di昀昀erent spellings. This results in two sets of contexts that also have no or little overlap. Hence, two spelling variants used in di昀昀erent time periods may end up in di昀昀erent regions in the embedding space. An example is the use ’ae’ in the early 18th century in words like ’aen’ (EN: to and ’haer’ (EN: her), which changed to using ’aa’ from around 1717, when they switched to writing ’aan’ en ’haar’. These words o昀琀en appear together, as in the common phrase ’aen haer hoogh mogende te’ (EN : to her high and mighty at). Because the spelling change for ’aen’ occurs at the same time as the spelling change for its common contextual term ’haer’, the variants ’aen’ and ’aan’ have little contextual overlap, so word embeddings consider them as di昀昀erent words. D. The impact of resolution length One of the characteristic of resolutions that we can study with our list of formulas is the fraction of a resolutions text is made up of formulaic expressions, and which part is not. There are many very short resolutions based on a received missive (starting with the formula ‘Ontfangen een Missive van’) that merely states who wrote the missive, when and where they wrote it, but that do not contain any proposal or request that the SG had to make a decision on. Such resolutions 149 Figure 5: The relationship between the length of a resolution and the fraction of words in a resolution that are part of a formula. end with the formula ‘Waar op geen resolutie is gevallen.’ (EN: On which no decision was made. We therefore expect there to be a relationship between the length of a resolution and the amount of non-formulaic content. Longer resolution provide more detail of the proposition or of the decision or both. We assume that these details are only given when they are deemed relevant and necessary. The details vary across resolutions, therefore lead to less formulaic text. The relationship between the length of resolutions and the fraction of words that are part of formulaic expressions is shown in Figure 5. There is a clear relationship: short resolutions tend to have a larger fraction of formulaic content. As resolutions get longer, a larger fraction of the words they contain are below one of the two frequency thresholds. E. Formulas in Other Corpora To check if this approach generalises to other corpora, we use the same detection process on the corpora of Notarial Deeds, Bern Manate books and t he Dutch Wikipedia. E.1. Mandate books of State of Bern Because of the high Character Error Rate (CER∼ 0.2) and the much smaller size of the corpus (3.8 million words compared to 58 million of the Resolutions), we used word 4-grams instead of 5-grams. The procedure found a handful of candidates, two of which could be extended to 150 formulaic expressions: • The most common phrase is ’Schultheiß und Rath der Statt Bern’, which occur 907 times in 505 di昀昀erent spellings. It refers to the head o昀케cial and the council of the city of Bern. • The second most frequently identi昀椀ed phrase is ’An alle Deütsch und Weltsche’, which oc- curs 559 times in 443 di昀昀erent spellings, and is extended to the formula ‘An alle Deütsch und Weltsche Herren Amtleüth’ (EN: To all German and X gentlemen o昀케cials). This is a formula to signal that the following statute pertains to the o昀케cials of both the French and German speaking parts of the city state Bern, and therefore signals the start of a statute. E.2. Notarial deeds from Amsterdam municipality The notarial deeds corpus also has a high CER of 15-20%. The most frequently found phrase are: • ‘als getuijgen hier overgestaen’ (EN: standing here as witnesses: this is part of a formulaic phrase in the opening paragraph of a notarial deed to indicate who act as witnesses in formalising the transaction. This is useful in identifying the starting paragraphs of deeds that are spread across pages. • ‘H Schaef N P’: the name one of the Amsterdam notaries, who is the o昀케cial responsible for ensuring the transaction is legal. • ‘J de Winter N P’: the name of another Amsterdam notary. These phrases have similar potential in identifying meaningful structural elements in the running text, such as where deeds start and end, and where certain elements of deeds are located within their text. 151