Obtaining More Expressive Corpus Distributions for Standardized Ancient Languages Oliver Hellwig1,2 , Sven Sellmer1,3 and Sebastian Nehrdich1,4 1 Institute for Language and Information, Heinrich Heine Universität, Düsseldorf 2 Department of Comparative Language Science, University of Zürich 3 Institute for Oriental Studies, Adam Mickiewicz University, Poznań 4 Khyentse Center for Tibetan Buddhist Textual Scholarship, Universität Hamburg Abstract This paper introduces a latent variable model for ancient languages that aims at quantifying the influence that early authoritative works exert on their literary successors in terms of lexis. The model jointly estimates the amount of word reuse, based on uni- and bigrams of words, and the date of composition of each text. We apply the model to a corpus of pre-Renaissance Latin texts composed between the 3rd c. BCE and the 14th c. CE. Our evaluation focusses on the structures of word reuse detected by the model, its temporal predictions and the quality of the inferred diachronic distributions of words, which last aspect is assessed using a newly designed task from the field of computational etymology. Keywords Text reuse, citations, standardized languages, historical corpora, Bayesian mixture model 1. Introduction Constructing diachronic trajectories of word1 frequencies seems to pose no major technical challenges. Given a database of timestamped texts and their linguistic annotations, one can derive such trajectories by applying smoothing techniques (e.g. temporal binning, kernel-based techniques) to the frequencies of words in individual texts. In the fields of Historical Linguistics and Classical Studies matters can, however, become more complicated because word frequencies can be influenced by various confounding factors such as the dialect or mother tongue spoken by an author, by changes in the orthography, or by language standardization, on which we focus in this paper. Following the definition given by Joseph [17], we use the term ‘standardized language’ for a codified, prestigious language variety that is mainly used for administrative and literary purposes. Examples of such languages include Latin as used in the post-Classical period or Sanskrit in the form prescribed by the grammarian Pāṇini. While the vocabularies, as well as stylistic features, of standardized languages may still change (see e.g. Clackson [5] for Latin and Wackernagel [42, XXIIff.] for Sanskrit), phonetics and morpho-syntax remain, so to say, frozen or undergo only minor diachronic changes. The work described in this paper primarily addresses the question to which degree the word usage in CHR 2021: Computational Humanities Research Conference, November 17–19, 2021, Amsterdam, The Netherlands £ Oliver.Hellwig@uni-duesseldorf.de (O. Hellwig); sellmer@hhu.de (S. Sellmer); nehrdich@uni-duesseldorf.de (S. Nehrdich) DZ 0000-0003-0387-2827 (O. Hellwig); 0000-0002-6688-0667 (S. Sellmer); 0000-0001-8728-0751 (S. Nehrdich) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Wor Pr ks hop oceedi ngs ht I tp: / CEUR Workshop Proceedings (CEUR-WS.org) / ceur - SSN1613- ws .or 0073 g 1 The term ‘word’ denotes the lemma of a word throughout this paper if not specified otherwise. 92 standardized languages reflects the everyday language use of authors who often spoke ‘vulgar’ varieties of the standardized language or, later, vernacular languages stemming from these varieties. Two factors are especially relevant here. First, many authors writing in standardized lan- guages show the tendency to reuse and paraphrase authoritative works which were considered as a kind of gold standard (see e.g. Lee [22]; Roberts [34] for Latin). The influence of earlier works can therefore bias and distort the distributions of words found in later ones. More gen- erally, such languages typically are conservative in that they preserve words that are no longer current outside literary circles. An instructive example for such a trend is the Latin word equus ‘horse’ [see 9, p. 291]. While this word is the standard expression for ‘horse’ in Classical Latin and does not have any archaic ring to it, the Romance languages, which originated from Latin dialects spoken in the late Antiquity (“Vulgar Latin”, see Herman [15]), derive their words for ‘horse’ from Latin caballus (e.g. Fr. cheval, It. cavallo), which suggests that occurrences of equus in post-Classical Latin texts no longer reflect the spoken language. Second, the temporal structure of ancient corpora may be (partly) unclear, making it even more difficult to reliably construct diachronic lexical trajectories. The accumulated effects that standardization, text reuse, semantic conservatism and temporal uncertainties exert on corpus distributions are difficult to determine from the raw corpus data alone, making it necessary to balance the corpus evidence with detailed qualitative – and time-consuming – studies of individual words. Such issues are not restricted to Latin texts of the late Antiquity and the Middle Ages [see e.g. 23], but are also found, for example, in Buddhist Chinese [see e.g. 28], in the Indic corpora composed in Sanskrit and Pāli, or in Classical Chinese [38]. As these languages are the ancestors of important modern language families, Classical Studies as well as Linguistics can benefit from corpus distributions that distinguish between the actual language use and the influence of authoritative works.2 This paper discusses a Bayesian mixture model for lemmatized texts that disentangles the influence exerted by authoritative, frequently cited and paraphrased texts on the word usage encountered in their literary successors. It aims at generating a clearer picture of the actual practice in standardized languages, at quantifying the amount of word reuse and at unveiling intellectual lineages in such corpora. For modelling word reuse, this paper builds on previous research that quantifies the influence of cited authors in the context of scientific publications [8, 27]. Unlike such bibliometric studies, citations in ancient corpora are mostly not (clearly) marked as such and must therefore be inferred from the data in the approach presented in this paper. The detection of literary influences can be further enhanced by inspecting lexical n-grams. While many previous approaches represent the textual data as bags of words, one may argue that text reuse and stylistic influences rather get manifest in collocations taken over from earlier literary works. While the presence of the unigrams aurum ‘gold’ and pretiosus ‘precious’ only gives a weak indication of literary ancestry, a bigram formed of these two words (in pretiosior auro ‘more precious than gold’) is a much clearer indication that the late Roman author Maximianus has been influenced by the Augustan poet Ovid. Our model therefore complements the bag of words representation with lexical bigrams [see 46] and makes the decision for uni- or bigrams part of the inference process. Another important aspect is the time of composition. Most (Bayesian) mixture models with 2 The expression ‘actual language use’ has to be taken in a technical sense that changes according to the author: For authors speaking some form of Latin, it refers to the language they use in everyday situations; for users of other languages, it denotes, somewhat artificially, the Latin they write, but from which the effects of word reuse have been removed, so to speak. 93 a temporal component assume that the time of composition is an observed variable (e.g. Blei and Lafferty [3], Wang, Blei, and Heckerman [44]). Such an assumption does not hold for many ancient texts as their dates are either unknown or still under scrutiny. While there exist some Latin texts whose dates of composition are strongly disputed (see e.g. Laurioux [21] on the cookbook of Apicius), this problem is more urgent for ancient Indian corpora, where dates proposed for early texts are often just educated guesses (see e.g. Olivelle [30], 7-13 on the Sanskrit philosophical texts called Upaniṣads). We address this issue by modelling the time of composition of each text as a latent variable that conditions the observed features and incorporates the current state of scholarly research with the help of a temporal prior (see Hellwig [14] for a related approach for Vedic Sanskrit). We use Latin texts composed between the 3rd c. BCE and the 14th c. CE as a test case. As Sec. 5 will show, many aspects of the evaluation rely on qualitative arguments, as gold standards for these tasks are currently not available. Using the Latin corpus offers the advantage that the evaluation can build on a long history of literary and linguistic research, so that our results can be compared against an extensive record of previous scholarship. The initial application to the well-researched Latin tradition makes it easier to transfer the methods developed here to more disputed textual traditions of South Asia. After an overview of related work in Computational Linguistics (Sec. 2), Sections 3 and 4 describe the data and the model. Section 5 assesses various choices in the model design using posterior predictive checks (Sec. 5.1) and presents an evaluation of three prominent aspects of our model: word reuse (Sec. 5.2), predicted times (Sec. 5.3) and the inferred corpus distributions (Sec. 5.4), the latter being tested on a new task in computational etymology. – Data and scripts are available at https://github.com/OliverHellwig/sanskrit/tree/master/ papers/chr2021. 2. Related research Our model of word reuse builds on previous work on detecting citation activities in scientific literature. Such activities have repeatedly been formalized using (ad-)mixture models, starting with Cohn and Hofmann [6] whose generative model conditions citations on the presence of hidden topics. Erosheva, Fienberg, and Lafferty [10] extend Latent Dirichlet Allocation by conditioning the generation of links on the same document-specific topic distributions as the generation of words. The citation-influence model of Dietz, Bickel, and Scheffer [8], also assuming citations to be fully observed, splits the process of generating words in two branches: a word in document d is either drawn from the topic distribution of a cited text (which is in turn sampled from a document-specific multinomial distribution over citable documents) or from a word distribution specific to d (“innovation”). Nallapati et al. [27] present two models that treat citations as latent variables sampled on the basis of document-specific topic distributions. Although not directly concerned with citations, the author-topic model of Rosen-Zvi et al. [35] offers an alternative view of what we want to achieve in this paper as some texts in ancient standardized languages can indeed be considered the work of a collective of – not necessarily contemporaneous – authors (see e.g. Colledge [7] on the composition of the Legenda aurea by an anonymous group of authors). Previous research has proposed various admixture models that contain a temporal com- ponent modelled either in discrete bins (e.g. Blei and Lafferty [3] or Frermann and Lapata [11] with Gaussian priors on logistic topic-word mixtures) or as continuous observed variables 94 Table 1 Composition of the corpus. The first column gives the historical period according to Adamik [1] (also see Sec. 5.3). Period Authors Tokens Old 3 9,653 Classical 59 2,630,289 Late 46 1,765,834 Transitional 13 90,656 Medieval 45 823,974 (e.g. Wang and McCallum [45]; Wang, Blei, and Heckerman [44]). More complex models as e.g. proposed by Kawamae [18] split the generation of words in time- and document-specific branches. Using bigrams in admixture models was first proposed by Wallach [43] (also see Nokel and Loukachevitch [29] Nokel and Loukachevitch [29] for a survey). While Wallach models all data points as bigrams, the collocation model of Griffiths, Steyvers, and Tenenbaum [13] makes the decision for uni- vs. bigrams part of the model structure. Wang, McCallum, and Wei [46] further make the decision for uni- vs. bigrams dependent from the hidden topic. 3. Data The experiments described in this paper are based on the works of 166 Latin authors who were active between the 3rd c. BCE and the 14th c. CE, the French philosopher Nicole Oresme (1320-1382) being the latest one included. From among the available Latin corpora (for an overview see McGillivray [24, ch. 2]), we chose the Latin library corpus of the CLTK library3 due to its wide coverage. An author is included if at least 50k of text are contained in the CLTK library or if the author is considered important for (text-)historical reasons (e.g. the Res gestae of Augustus). The raw source data are unbalanced (authors such as Cicero or Thomas Aquinas are strongly over-represented), and individual works are often split into multiple files. We therefore merge all works of one author into a single text, although, arguably, the preference for citing and reusing text can vary inside the oeuvre of an author. Latin is a strongly inflectional language. In addition, the orthography of some source texts has not been standardized, and especially the late Christian authors are responsible for some variation so that working with raw textual data would result in very sparse feature matrices. All texts are therefore lemmatized using Collatinus [31] (which manages to resolve many of the non-standard spellings in the process) and these lemmatized versions constitute the data used for all following steps of the processing pipeline. After removing 104 stop words such as ad ‘to(wards)’, et ‘and’ or meus ‘my’ as well as lemmata that occur less than 30 times, the corpus consists of 5,320,406 word tokens with 10,309 distinct lemmata (also see the summary in Tab. 1). Public sources such as the Encyclopedia Britannica and Wikipedia are used for gathering information about the lifetime of each author (ld , ud : birth and death years of author d). If not specified otherwise, the date md of a text d denotes the mean of this time span, i.e. md = 12 (ld + ud ). 3 thelatinlibrary.com, http://cltk.org/ 95 4. Model The model discussed in this paper needs to deal with three types of uncertainty: (1) unknown structures of word reuse; (2) fuzzy or unknown dates of composition; (3) the question whether uni- or bigrams of words should be used as the observed features. This leads to the following generative story (see eq. 2 for the complete specification): First, for the ith word in text d, the source text cdi is drawn from a text-specific multinomial distribution ξd . Note that ξd includes the text d itself. Such self-loops mean that the respective data point is peculiar to the actual author of text d.4 While many citation models proposed so far can build on a given citation structure (as e.g. defined by web links or scholarly citations in articles), this information is not available for our data. The value of the prior αij (text i cites from text j) therefore needs to be adapted during inference depending on the inferred latent times. After each iteration of the Gibbs sampler (this means after running it once over all data points), the mean time slots µ of all texts are calculated based on the current state of the latent temporal assignments, and the value of αij is updated using a sigmoid function: 10 if i = j  0 if µj − µi > 3  αij = (1)  1 1+exp(−(µi −µj )) else The high value for αii encourages the model to explain the words observed in a text by the preferences of its author. Note that the zeros for the case µj −µi > 3 are structural zeros so that text j is not considered a possible source of i if αij = 0.5 In addition, we multiply each element of α with a citation mask m ∈ {0, 1}D×D that is derived from running a Levenshtein-based citation detector over the unlemmatized texts. The value mij is set to 1 if at least one sequence of five or more words is shared by texts i, j; else to zero. Zero values in m are again interpreted as structural zeros. The use of this mask is based on the idea that literal citations, as detected by the Levenshtein algorithm, indicate the acquaintance of an author with a previous work and thus increase the probability that individual words from this previous work are used as well. Second, a time slot tdi is drawn from a text-specific multinomial temporal distribution ωcdi . The prior βcdi of ωcdi incorporates the current state of scholarly knowledge about the time of composition of text cdi , and possible time slots obtain a flat uniform prior in the range lcdi , ucdi while slots outside [lcdi , ucdi ] are set to structural zeros. Third, the model draws a Bernoulli-distributed variable bdi that decides if the word xdi and its successor xdi+1 typically form a bigram. Contrary to the model proposed by Wang, McCallum, and Wei [46], this decision does not depend on the sampled time tdi and thus saves (T − 1) · V 2 trainable parameters. Based on the sampled value of bdi , either the unigram xdi or the bigram xdi xd i+1 is drawn from time-specific multinomial distributions ϕU tdi resp. ϕtdi xdi . B With Θ denoting all trainable parameters and π all priors, the joint distribution is given by 4 This choice is represented by the Beta distributed variable λ in Dietz, Bickel, and Scheffer [8]. 5 The difference of three time slots is motivated by the following idea: As will be shown in Sec. 5.1, 150 is a good choice for the number of time slots. As the whole corpus covers a temporal range of about 1,700 years, three time slots correspond to slightly more than 30 years, a span that may describe the active period of one author. 96 Table 2 Variables, dim(ensions), par(ameters), c(ounters) and pr(iors) of a model with D documents, T time slots and a vocabulary size of V Variable Dim. Par. C. Pr. text → citation RD×D ξ A α citation → time RD×T ω B β time → unigrams RT ×V ϕU CU γU time → bigrams RT ×V ×V ϕB CB γB 2 words → uni-/bigr. RV ×V ψ L δ the following equation (notation in Tab. 2): D D T Dir(ξd |αd ) Dir(ωd |βd ) Dir(ϕU ∏ ∏ ∏ U p(c, b, t, x, Θ|π) = u |γ ) d d u T ∏ V V ∏ V Dir(ϕB Beta(ψvw |δ) ∏ ∏ B · uv |γ ) u v v w nd D ∏ [∏ Cat(cdi |ξd )Cat(tdi |ωcdi ) · Bern(bdi |ψxdi xdi+1 ) bdi Cat(xd i+1 |ϕB [ ( · tdi xdi ) d i )]] + (1 − bdi )(Cat(xdi |ϕU tdi )) (2) The blocked Rao-Blackwellized Gibbs Sampler [12] is obtained by using Dirichlet-multinomial integrals: p(cdi = e, tdi = k, bdi = l, xdi = u, xd i+1 = v|c−di , t−di , x−di , Θ, π) ) CB (−di) +γvB   L1xdi ( (−di) 1 B −di + β ek  x di+1 + δ kuv ∑V B (−di) B bdi = 1 ∝ (A−di de + α ) de ∑T ek −di w Ckuw U −di +γw U + βel  L ( 0 (−n) ) Cku +γ l B  +δ ∑ 0 u bdi = 0 el xdi xdi+1 V U −di U w Ckw +γw A small, but important difference to models that operate with a known citation structure is the selection of possible sources. In this paper, a text c is only considered as a possible source for an observed uni- xdi or bigram xdi xd i+1 if it also contains xdi or xdi xd i+1 . This condition prevents the model from assigning too much weight to early authors such as Cicero and Vergil. 5. Experiments This section reports qualitative and quantitative evaluations for the three relevant elements of our model: the detected structure of word reuse (Sec. 5.2), the temporal predictions (Sec. 5.3) and the diachronic trajectories of words that can be inferred from it (Sec. 5.4). 5.1. Architecture and Parameter Settings We use posterior predictive checks (PPC; Mimno, Blei, and Engelhardt [26]) to compare various model architectures and parameter settings. Given a trained model, we draw textwise samples of the observed words using Eq. 2 and compare these samples with the true distributions in 97 (a) No. of slots, η 2 = (b) Prior γ, η 2 = 0.4899 (c) Prior δ, η 2 = 0.01540 (d) Cit. mask, η 2 = 0.0004 0.5912 Figure 1: Results of the posterior predictive checks and Cohen’s η 2 . Each colored curve shows the density of the z-standardized values for one setting. Small z-scores are better. each text using the Hellinger Distance. The values that result from 30 replications per text are grouped by texts and z-standardized, and ANOVAs are performed in order to test for significant differences between settings. Figure 1 shows smoothed density estimates of these z-scores for four central design choices: the number of temporal slots (Fig. 1a), the parameters γ (time → feature; Fig. 1b) and δ (uni- or bigram; Fig. 1c) and the use of the precomputed citation mask (Fig. 1d). While ANOVA points to (highly) significant differences in all four settings, the values of Cohen’s η 2 which quantify the effect size and are displayed below each subfigure indicate that only the number of slots and the prior γ have a relevant influence on the outcome of the model, while the influence of δ and the citation mask must be considered as very small. Based on this evaluation, we choose 150 time slots, γ = 0.5, δ = 0.01 for all following experiments, and we apply the citation mask. Running another PPC for establishing the optimal number of iterations of the Gibbs sampler, we found no significant differences between models trained with 100, 300, 500 or 1,000 iterations (p-value of the ANOVA: 0.163). This somehow unexpected result is certainly due to the fact that our model already has rather strong priors induced by the structural zeros in the citation mask and the temporal prior β so that only few iterations are required to obtain a good representation of the data. We therefore run the sampler for 100 iterations and record the sampled values once after the last iteration. 5.2. Word Reuse As mentioned in the introduction, understanding the intellectual lineages of historical corpora is one important aim of this paper. Therefore, the evaluation starts with inspecting the inferred structure of word reuse. We calculate, for each text d, the proportion of words labeled as reused, i.e. for which cdi ̸= d according to the model output. These proportions can be expected to be correlated with the true date of d, as later texts have more opportunities to reuse words than earlier ones. In order to deal with this effect, we perform a partial correlation by fitting a linear regression that predicts the proportions of words labeled as reused (y) based on the number of possible source texts (x). The residuals of this regression, which capture how much the model output deviates from the linear estimate, are plotted against the true date of each text (see Fig. 2a). Here, the dashed horizontal line at y = 0 corresponds to a residual of 0 and thus to a perfect prediction of the model output through the linear regression. The blue curved line is a smoothed density estimate of the actual residuals. This smoothed estimate shows that the proportions of word reuse conform to the values estimated by the linear model until the end of 98 (a) Residuals of a linear regression that predicts the (b) Schematical representation of word reuse, inferred number of reused words given the num- grouped by literary periods. The source periods ber of available source authors. Individual au- are found at the bottom. The line width indicates thors are labeled if their residuals fall in the 5% the strength of the activity. resp. 95% quantiles. Figure 2: Patterns of word reuse detected by the model the Late Antiquity (5th c. CE). We observe increasing word reuse in the 8th or 9th c. CE, a period commonly known as the Carolingian Renaissance, which saw a revival of classical Latin literature that accompanied the formation of the Carolingian state [see e.g. 39]. In the 10th c. CE and later, the proportions of word reuse tend to fall below their expected values. Seen from the perspective of literary history, the intensive word reuse in the Carolingian Renaissance is connected with authors such as Hrabanus Maurus, Angilbert or Alcuin, who in his De rhetorica freely mixes extracts from Cicero’s De inventione and other authoritative sources with his own comments [see e.g. 19]. In the 10th c., a new form of Latin is constituted, which, though still accepting the classical language as its gold standard, is strongly influenced by the idiom of Christian theological authors (“Ecclesiastical Latin”, see e.g. Dinkova-Bruun [9]). This form of medieval Latin can therefore be expected to share less lexical features with Classical Latin than earlier forms of the language. Figure 2b presents another view of the literary influences. In this plot all texts are aggregated by the five literary periods defined by Adamik [1], plus an extra period “Medieval Latin” starting at 900 CE.6 The widths of the lines between target, i.e. “citing” (top), and source, i.e. “cited” (bottom), periods indicate the relative amount of word reuse inferred by the model. The plot shows that works from the classical era quite constantly remained important sources of word reuse throughout all periods considered in this paper, although even their influence begins to wane in the Transitional Period (600–900 CE) and the Middle Ages. Such a result makes sense as the works of some classical authors did not survive the breaks in the political and religious history and were only rediscovered in the Italian Renaissance or even later (see e.g. Tutrone [40] on the limited reception of the important Roman philosopher Lucretius in the (early) Middle Ages). The strong connections between Late Latin on one hand and the transitional and medieval periods on the other are due to the numerous important Christian texts composed in Late Antiquity, most notably the Latin translation of the Bible (Vulgata) 6 We label the period called Vulgar Latin by Adamik as Late Latin in this paper in order to distinguish it from the sub-standard variety discussed by Herman [15]. 99 and the work of Augustine. In addition, Fig. 2b shows a decline in word reuse between the Transitional Period, which comprises the Carolingian Renaissance just discussed, and Medieval Latin – most authors from the Transitional Period were obviously not too much regarded in later times. In order to understand which authors are mainly responsible for the distribution observed in Fig. 2b, we collect, for each literary period, those three authors with the highest amount of words marked as reused, applying a minimal threshold of 1,000. The resulting list contains the following authors: Old Cato (the Elder) is the only representative of old Roman literature, a result which is in accordance with his extraordinary importance for the development of a genuinely Latin literature. His compendia on agriculture and warfare as well as the collection of his orations (compiled by himself) exerted a considerable influence on later authors [2, pp. 340–41]. Classical Ovid and Cicero can be seen as the top representatives of Latin poetry and prose, while Livy stands for the genre of classical historiography. Late This period shows an interesting interference between the famous Christian author Au- gustine and the Vulgata, a new translation of the Bible composed by Jerome. Different from what may be expected, Augustine is more frequently marked as cited than the Vulgata (161,501 vs. 74,445 times). A closer inspection of words and bigrams labeled as cited reveals that the model has problems in assigning individual Biblical citations to the Vulgata or the Vetus Latina, the older Latin version of the Bible preferably cited by Augustine [see e.g. 16, pp. 36–39]. – The third representative of this period is Gregory of Tours, best known for his historical writings. Transitional Here, only Beda has made it in the list – a result fully in accordance with his popularity in the Middle Ages [47]. Medieval While Thomas Aquinas is a central representative of medieval Latin and its focus on theological discussions, Albert of Aix and William of Tyre represent the genre of medieval historical writings with a special focus on the Crusades. To sum up this section, it appears that the model was able to recover structures of word reuse that conform to scholarly expectations. 5.3. Timestamping We model the partly unclear times of composition as latent variables. In this section we assess the quality of the resulting temporal predictions. We simulate a research setting in which only approximate temporal information is available, by setting the temporal ranges of all D texts d to the ranges of the literary periods containing them according to Adamik [1, p. 9]. These artificially obfuscated ranges are used as temporal priors βd (see eq. 2). All texts are trained jointly, and we evaluate how well the model can recover the exact dates and the correct temporal order of the texts. Notably, this experiment is not merely another academic exercise, but bears practical implications when studying ancient Indian text corpora for which only approximate temporal information is available [see 14]. – Table 3 reports two evaluation measures: • The period-wise mean absolute error (MAE) calculated as |{d∈P 1 d∈P ||md −µd ||1 where ∑ }| µd is the mean of the word-wise temporal assignments for text d, and P is the literary period. • Ranking accuracy: The texts are grouped by their literary periods, and all texts belong- 100 Table 3 Grouped mean absolute errors (MAE; in years) and ranking accuracies of the temporal predictions for five literary periods Period MAE Rank acc. Old 41.1 0.0 Classical 87.7 52.5 Vulgar 101.3 48.6 Transitional 67.4 59.0 Medieval 136.5 40.3 ing to one period are ordered by their true dates md . The ranking accuracy gives the proportion of text pairs for which the predicted temporal order is the same as the true one. The results in Tab. 3 show that dating texts composed in standardized languages is challenging. Although the literary periods only extend over 200-300 years each, the MAEs vary between 40 and 140 years and thus cover substantial parts of each period. It may, however, be noted that Kumar, Lease, and Baldridge [20] report slightly higher MAEs of 85-155 years for English stories published between 1798 and 2008, which suggests that the results achieved by our model are actually in an acceptable range. The values of the ranking accuracy are coupled with the uncertainties in the temporal predictions and fall below the random baseline of 50% for three of the five periods. Notably, both evaluation measures seem to get worse for post-Classical texts when Latin gradually ceased to be used as a spoken language, and an ANOVA of the MAEs as well as a Fisher-Yates test of the raw counts for the ranking accuracies both show (highly) significant differences between all periods (p-values: 0.00147 [MAE]; 0.0005 [ranking acc.]). In order to assess if the temporal predictions improve when more reliable temporal infor- mation is available, we perform a cross-validation experiment. A subset of fifteen authors7 is chosen as the test set. For each text in this set, we obfuscate its date in the same way as in the first experiment, while all D − 1 other texts keep their temporal gold information. The model is trained with the D − 1 training texts for 100 iterations and then for another 100 iterations with the combined training and test set (see the method Gibbs1 in Yao, Mimno, and McCal- lum [49]). The results are compared with the predictions made by the Topic over Time model [45] which is often used as a baseline for latent variable models with a temporal component.8 The results in Tab. 4 show that our model is slightly, but not significantly better than ToT (p-value of a paired directed Wilcoxon test: 0.26). While ToT occasionally assigns all texts from one period to the same date range, our model better captures the temporal dynamics. This impression is confirmed when calculating the ranking accuracy (ours: 60%; ToT: 33%) for the data in Tab. 4. 7 This limitation is due to time constraints. We choose three authors from the start, middle and end of each period; see the first column of Tab. 4. 8 We use 150 topics and all hyperparameter∑settings as described in the original paper. The predicted time slot is that with the highest posterior argmaxt n i log p(t|ψzi ), with the additional constraint that ld ≤ t ≤ ud , d in order to make a fair comparison with the model presented in this paper; see Sec. 2 in Wang and McCallum [45]. 101 Table 4 Cross-validated temporal predictions of the model in this paper and ToT [45]. The best prediction per text is printed bold. Predictions that fall in the true temporal range of a text are underlined. Text Date This paper ToT Naevius -270/-201 -179 -298 Ennius -239/-169 -229 -298 Cato -234/-149 -260 -298 Cicero -106/-43 78 -6 Seneca Y. -4/65 120 -4 Apuleius 123/170 121 1 Commodianus 225/275 407 374 Leo the Great 390/461 431 406 Maximianus 500/600 413 374 Chron. Fredegar 600/700 824 813 Alcuin 735/804 868 735 Erchempert 850/900 678 753 Leo of N. 900/1000 1095 1312 Bernard de C. 1100/1200 1138 1300 Nicole O. 1320/1382 1297 1291 MAE 93.2 114.8 5.4. Features Getting a more realistic picture of how words are diachronically distributed in standardized languages is an important aim of this paper. This section therefore compares the linguistic expressiveness of empirical corpus distributions with those inferred by our model. Using pos- terior estimates of the variational parameters (i.e. ωdt ′ = ∑ Bdt +τdt T etc.) based on those cases u Bdu +τdu in which the model assigns words to unigrams, we obtain the conditional probabilities p(x|d) of a word x given a text d by marginalizing the latent citations and temporal assignments: D ∑ T D ∑ T ′ (3) ∑ ∑ ′ ′ U p(x|c) = p(c|d)p(t|c)p(x|t) = ξdc ωct ϕ tx c t c t We expect that the diachronic trajectories of this conditional distribution differ from the corpus distribution of a word x when the use of x in later texts is mainly due to literary influences. In order to quantitatively support the claim that the inferred distributions yield a more realistic description of the actual language use, we address the problem of predicting lexical stability [see 37]. While substantial parts of the vocabulary of Romance languages can be derived from precursors in (Vulgar) Latin by applying rules of regular sound change [36, 37], there are important individual words such as equus ‘horse’ or whole classes of words such as the vocabulary of war that do not have derivatives in the Romance languages. Apart from various socio-cultural factors (on which see e.g. Campbell [4], 244ff. and especially Vincent [41] on Latin vocabulary that was “submerged” in classical works), the frequency of use in the spoken language is a determining factor for the survival or obsolescence of a word [32]. If the inferred distributions better capture the actual use than the corpus distributions, they should better be able to predict the survival of Latin words in the Romance languages.9 9 A factor we ignored as non-essential for the present purpose, but which should be taken into account in 102 As the etymological information in Wiktionary is incomplete and noisy [48], we collect all Latin words that are recorded as etyma of Romance words in Meyer-Lübke [25], a standard reference work of Romance etymologies. Although scholarly research has revised some decisions made in this work, it is still considered as a largely complete collection of surviving Latin ety- ma (see e.g. Stefenelli [37], 568) so that words not recorded there can be assumed not to have derivatives in Romance languages. From among the 10,308 words in our vocabulary, 2691, i.e. 26.1%, have such a derivative.10 We aggregate the empirical and inferred distributions by various ranges of years, z-standardize these binned values and use them as input features for a feed-forward neural network with four hidden units and softplus activations. The neural network is trained on the binary prediction task whether or not a Latin word has derivatives in any Romance language.11 Figure 3a shows the F-scores (y-axis) depending on the sizes of the temporal bins applied (x-axis). While the F-scores generally decrease with increasing sizes of the temporal bins, the F-score of the inferred distributions is consistently higher than that of the empirical ones. The drop of the F-score is especially obvious when using the empirical distributions with a bin size of 30 years instead of the unbinned distributions (“all”). The failure of the model that uses the empirical distribution is due to its low recall in these cases. In order to better understand the behaviour of the predictor, we collect all inherited words that were labelled correctly using the inferred, but wrongly using the empirical distributions, calculate their empirical and inferred distributions and smooth these distributions with a Gaussian kernel. Figure 3c contrasts the means (plus/minus one standard deviation) of the two groups. The plot shows that the inferred distribution transfers probability mass from occurrences in (late) classical texts to the (early) Middle Ages (∼ 8th c.+), i.e. to a period in which the Romance languages are generally assumed to develop. A similar effect can be observed for words which are only predicted correctly when using the empirical distributions (see Fig. 3d). Apparently, the mixture model has missed effects of word reuse in these cases, as it assigns too much weight to occurrences in the early Middle Ages. Finally, when examining distributions of inherited words detected by neither classifier, it becomes apparent that many of them are popular in classical and medieval texts, but rare in the Late Antiquity and the Transitional Period (see e.g. the plots for expecto ‘expect’ in Fig. 3b). Although the mixture model draws up the distributions for the critical phase of the early Middle Ages, this effect is not strong enough to make the classifier label such words as inherited. 6. Summary Diachronic corpora are indispensable tools for studying linguistic developments and intellec- tual lineages in premodern societies. Depending on the degree of standardization which the corpus language has undergone as well as on the amount of text reuse, linguistic distributions extracted from diachronic corpora can be misleading because the language usage of authori- future, more detailed studies, is the fact that, starting from the early Middle Ages, for a growing number of authors their mother tongue is not a Romance language but belongs to another family (mostly the Germanic one). 10 Note that this number only covers Romance words derived by regular sound change, but not, for example, borrowed words. 11 Meyer-Lübke [25] does not consistently report all Romance derivatives of a given Latin word, so that we could not formulate this problem as a multi-class prediction task. – Apart from a simple neural network, we also tested flat ML models such as logistic regression, but found our approach to perform better. 103 (a) F-scores of predicting the lexical stability (b) Normalized and smoothed empirical and of Latin words in Romance languages; val- inferred distributions of the word expecto ues reported without grouping (“all”) and ‘expect’ grouped by the number of years per tem- poral bin. (c) Inferred correct, empirical wrong (d) Inferred wrong, empirical correct Figure 3: Results of the etymology prediction task: F-scores when using the empirical resp. inferred distributions (Fig. 3a); smoothed accumulated distributions for cases in which only the inferred (Fig. 3c) or empirical distributions (Fig. 3d) produce the correct result; and an example of a word whose etymological development was mispredicted by both distributions (Fig. 3b). tative, frequently cited works can conflate with that of their literary successors. This paper introduces a latent variable model that captures such literary influences while simultaneously accounting for uncertainties in the temporal assignments. While the latter aspect is only of limited importance for Latin, the corpus language discussed in this paper, it is certainly rele- vant for many ancient corpora whose temporal structure is more disputed. Our discussion has shown that the model retrieves meaningful intellectual lineages and structures of word reuse (see Sec. 5.2) and performs on par with latent variable models specifically designed for captur- ing temporal topical trends (Sec. 5.3). In addition, the discussion of etymological derivations in Sec. 5.4 indicates that the linguistic distributions generated by the model are better able to describe certain aspects of language development than plain corpus distributions. Future extensions should incorporate a component that smoothes the temporal distributions [see e.g. 11], and they should consider non-temporal influence factors such as the geographic origin or genre of a text, as was proposed by Perrone et al. [33]. Given this outcome, we are planning to apply the mixture model on text traditions of ancient South Asia whose intellectual and diachronic structures are still not fully understood. 104 Acknowledgments We thank Sabine Tittel for her help with digital resources for Romance languages and the three anonymous reviewers for their insightful comments. The authors were partly funded by the German Federal Ministry of Education and Research, FKZ 01UG2121. References [1] B. Adamik. “The Periodization of Latin. An Old Question Revisited”. In: Latin Lin- guistics in the Early 21st Century. Ed. by G. V. Haverling. Uppsala Universitet, 2015, pp. 640–652. [2] M. von Albrecht. Geschichte der römischen Literatur. Vol. 1. Berlin: de Gruyter, 2012. [3] D. M. Blei and J. D. Lafferty. “Dynamic Topic Models”. In: Proceedings of the 23rd International Conference on Machine Learning. 2006, pp. 113–120. [4] T. Campbell. Historical Linguistics. Edinburgh: Edinburgh University Press, 2013. [5] J. Clackson. “Classical Latin”. In: A Companion to the Latin Language. Ed. by J. Clack- son. Maiden, MA: Blackwell Publishing, 2011, pp. 236–256. [6] D. Cohn and T. Hofmann. “The Missing Link – A Probabilistic Model of Document Content and Hypertext Connectivity”. In: Advances in Neural Information Processing Systems. 2001, pp. 430–436. [7] E. Colledge. “James of Voragine’s “Legenda Sancti Augustini” and its Sources”. In: Au- gustiniana 35.3/4 (1985), pp. 281–314. [8] L. Dietz, S. Bickel, and T. Scheffer. “Unsupervised Prediction of Citation Influences”. In: Proceedings of the 24th ICML. 2007, pp. 233–240. [9] G. Dinkova-Bruun. “Medieval Latin”. In: A Companion to the Latin Language. Ed. by J. Clackson. Maiden, MA: Blackwell Publishing, 2011, pp. 284–302. [10] E. Erosheva, S. Fienberg, and J. Lafferty. “Mixed-membership Models of Scientific Pub- lications”. In: Proceedings of the National Academy of Sciences 101.suppl 1 (2004), pp. 5220–5227. [11] L. Frermann and M. Lapata. “A Bayesian Model of Diachronic Meaning Change”. In: Transactions of the Association for Computational Linguistics 4 (2016), pp. 31–45. [12] T. L. Griffiths and M. Steyvers. “Finding Scientific Topics”. In: Proceedings of the Na- tional Academy of Sciences 101.Suppl. 1 (2004), pp. 5228–5235. [13] T. L. Griffiths, M. Steyvers, and J. B. Tenenbaum. “Topics in Semantic Representation.” In: Psychological Review 114.2 (2007), pp. 211–244. [14] O. Hellwig. “Dating and Stratifying a Historical Corpus with a Bayesian Mixture Model”. In: Proceedings of LT4HALA. 2020, pp. 1–9. [15] J. Herman. Vulgar Latin. University Park, Pennsylvania: Pennsylvania State University Press, 2000. [16] H. Houghton. The Latin New Testament. Oxford: Oxford University Press, 2016. [17] J. E. Joseph. Eloquence and Power: The Rise of Language Standards and Standard Languages. London: Frances Pinter, 1987. 105 [18] N. Kawamae. “Trend Analysis Model: Trend Consists of Temporal Words, Topics, and Timestamps”. In: Proceedings of the fourth ACM International Conference on Web Search and Data Mining. 2011, pp. 317–326. [19] M. S. Kempshall. “The virtues of rhetoric: Alcuin’s “Disputatio de rhetorica et de uir- tutibus””. In: Anglo-Saxon England 37 (2008), pp. 7–30. [20] A. Kumar, M. Lease, and J. Baldridge. “Supervised Language Modeling for Temporal Resolution of Texts”. In: Proceedings of the 20th ACM CIKM. 2011, pp. 2069–2072. [21] B. Laurioux. “Cuisiner à l’antique: Apicius au Moyen Âge”. In: Médiévales 26 (1994), pp. 17–38. [22] J. Lee. “A Computational Model of Text Reuse in Ancient Literary Texts”. In: Proceedings of the 45th ACL. 2007, pp. 472–479. [23] E. Manjavacas, F. Karsdorp, and M. Kestemont. “A Statistical Foray into Contextual Aspects of Intertextuality”. In: Proceedings of the Workshop on Computational Human- ities Research (CHR 2020). Ed. by F. Karsdorp, B. McGillivray, A. Nerghes, and M. Wevers. 2020, pp. 77–96. [24] B. McGillivray. Methods in Latin Computational Linguistics. Vol. 1. Brill’s Studies in Historical Linguistics. Leiden: Brill, 2014. [25] W. Meyer-Lübke. Romanisches etymologisches Wörterbuch. Heidelberg: Winter, 1935. [26] D. Mimno, D. M. Blei, and B. E. Engelhardt. “Posterior Predictive Checks to Quantify Lack-of-fit in Admixture Models of Latent Population Structure”. In: Proceedings of the National Academy of Sciences 112.26 (2015), E3441–e3450. [27] R. M. Nallapati, A. Ahmed, E. P. Xing, and W. W. Cohen. “Joint Latent Topic Models for Text and Citations”. In: Proceedings of the 14th ACM SIGKDD. 2008, pp. 542–550. [28] S. Nehrdich. “A Method for the Calculation of Parallel Passages for Buddhist Chinese Sources Based on Million-scale Nearest Neighbor Search”. In: Journal of the Japanese Association for Digital Humanities 5.2 (2020), pp. 132–153. [29] M. Nokel and N. Loukachevitch. “Accounting N-grams and Multi-word Terms can Im- prove Topic Models”. In: Proceedings of the 12th Workshop on Multiword Expressions. 2016, pp. 44–49. [30] P. Olivelle. The Early Upaniṣads. Annotated Text and Translation. Oxford: Oxford Uni- versity Press, 1998. [31] Y. Ouvrard and P. Verkerk. “Collatinus & Eulexis: Latin & Greek Dictionaries in the Digital Ages”. In: Digital Classics III: Re-thinking Text Analysis. Center for Hellenic Studies/Harvard University, 2017. [32] M. Pagel, Q. D. Atkinson, and A. Meade. “Frequency of Word-use Predicts Rates of Lex- ical Evolution throughout Indo-European history”. In: Nature 449.7163 (2007), pp. 717– 720. [33] V. Perrone, M. Palma, S. Hengchen, A. Vatri, J. Q. Smith, and B. McGillivray. “GASC: Genre-Aware Semantic Change for Ancient Greek”. In: Proceedings of the 1st Interna- tional Workshop on Computational Approaches to Historical Language Change. 2019, pp. 56–66. 106 [34] M. J. Roberts. The Hexameter Paraphrase in Late Antiquity: Origins and Applications to Biblical Texts. Urbana-Champaign: University of Illinois, 1978. [35] M. Rosen-Zvi, T. Griffiths, M. Steyvers, and P. Smyth. “The Author-topic Model for Authors and Documents”. In: Proceedings of the 20th Conference on Uncertainty in AI. 2004, pp. 487–494. [36] J. B. Solodow. Latin Alive. The Survival of Latin in English and the Romance Languages. Cambridge: Cambridge University Press, 2009. [37] A. Stefenelli. “Lexical Stability”. In: The Cambridge History of the Romance Languages. Volume I: Structures. Ed. by M. Maiden, J. C. Smith, and A. Ledgeway. Cambridge: Cambridge University Press, 2011, pp. 564–584. [38] C. R. Stone. “What Plagiarism was not: Some Preliminary Observations on Classical Chinese Attitudes Toward What the West Calls Intellectual Property”. In: Marquette Law Review 92 (2008), p. 199. [39] G. Trompf. “The Concept of the Carolingian Renaissance”. In: Journal of the History of Ideas 34.1 (1973), pp. 3–26. [40] F. Tutrone. “Lucretius Franco-Hibernicus: Dicuil’s Liber de Astronomia and the Carolin- gian Reception of De Rerum Natura”. In: Illinois Classical Studies 45.1 (2020), pp. 224– 252. [41] N. Vincent. “Continuity and Change from Latin to Romance”. In: Early and Late Latin. Continuity or Change? Ed. by J. Adams and N. Vincent. Cambridge: Cambridge Uni- versity Press, 2016, pp. 1–13. [42] J. Wackernagel. Altindische Grammatik. I. Lautlehre. Göttingen: Vandenhoek und Ruprecht, 1896. [43] H. M. Wallach. “Topic Modeling: Beyond Bag-of-words”. In: Proceedings of the 23rd ICML. 2006, pp. 977–984. [44] C. Wang, D. Blei, and D. Heckerman. “Continuous Time Dynamic Topic Models”. In: Proceedings of the Twenty-Fourth Conference on Uncertainty in Artificial Intelligence. 2008, pp. 579–586. [45] X. Wang and A. McCallum. “Topics over Time: A Non-Markov Continuous-time Model of Topical Trends”. In: Proceedings of the 12th ACM SIGKDD International conference on Knowledge Discovery and Data Mining. 2006, pp. 424–433. [46] X. Wang, A. McCallum, and X. Wei. “Topical N-grams: Phrase and Topic Discovery, with an Application to Information Retrieval”. In: Proceedings of the Seventh ICDM. 2007, pp. 697–702. [47] D. Whitelock. After Bede. Newcastle: Bealls, 1978. [48] W. Wu and D. Yarowsky. “Computational Etymology and Word Emergence”. In: Pro- ceedings of The 12th Language Resources and Evaluation Conference. 2020, pp. 3252– 3259. [49] L. Yao, D. Mimno, and A. McCallum. “Efficient Methods for Topic Model Inference on Streaming Document Collections”. In: Proceedings of the 15th ACM SIGKDD. 2009, pp. 937–946. 107