Vir is to Moderatus as Mulier is to Intemperans
                                Lemma Embeddings for Latin

             Rachele Sprugnoli, Marco Passarotti, Giovanni Moretti
           CIRCSE Research Centre, Università Cattolica del Sacro Cuore
                    Largo Agostino Gemelli 1, 20123 Milano
{rachele.sprugnoli,marco.passarotti,giovanni.moretti}@unicatt.it


                       Abstract                                wide diachronic span covering more than two mil-
                                                               lennia, as well as its diatopic distribution across
        English. This paper presents a new set of              Europe and the Mediterranean, Latin is the most
        lemma embeddings for the Latin language.               resourced historical language with respect to the
        Embeddings are trained on a manually an-               availability of textual corpora. Large collections
        notated corpus of texts belonging to the               of Latin texts, e.g. the Perseus Digital Library2
        Classical era: different models, architec-             and the corpus of Medieval Italian Latinity ALIM 3 ,
        tures and dimensions are tested and evalu-             can now be processed with state-of-the-art com-
        ated using a novel benchmark for the syn-              putational tools and methods to provide linguistic
        onym selection task. A qualitative evalua-             resources that enable scholars to exploit the em-
        tion is also performed on the embeddings               pirical evidence provided by such datasets to the
        of rare lemmas. In addition, we release                fullest. This is particularly promising given that
        vectors pre-trained on the “Opera Maiora”              the quality of many textual resources for Latin,
        by Thomas Aquinas, thus providing a re-                carefully built over decades, is high.
        source to analyze Latin in a diachronic                   Recent years have seen the rise of language
        perspective.1                                          modeling and feature learning techniques applied
                                                               to linguistic data, resulting in so-called “word
1       Introduction                                           embeddings”, i.e. empirically trained vectors of
Any study of the ancient world is inextricably                 lexical items in which words occurring in simi-
bound to empirical sources, be those archaeologi-              lar linguistic contexts are assigned close vectorial
cal relics, artifacts or texts. Most ancient texts are         space. The semantic meaningfulness and motiva-
written in dead languages, one of the distinguish-             tion of word embeddings stems from the basic as-
ing features of which is that both their lexicon and           sumption of distributional semantics, according to
their textual evidence are essentially closed, with-           which the distributional properties of words mir-
out any new substantial addition. This finite na-              ror their semantic similarities and/or differences,
ture of dead languages, together with the need of              so that words sharing similar contexts tend to have
empirical data to their study, makes the preserva-             similar meanings.
tion and the careful analysis of their legacy a core              In this paper, we present and evaluate a num-
task of the (scientific) community. Although com-              ber of embeddings for Latin built from a manu-
putational and corpus linguistics have mainly fo-              ally lemmatized dataset containing texts from the
cused on building tools and resources for modern               Classical era.4 In addition, we release embed-
languages, there has always been large interest in             dings trained on a manually lemmatized corpus
providing scholars with collections of texts writ-             of medieval texts to facilitate diachronic analyses.
ten in dead or historical languages (Berti, 2019).             This research is performed in the context of the
Not by chance, one of the first electronic corpora             LiLa: Linking Latin project, which seeks to build
ever produced is the “Index Thomisticus” (Busa,                a Knowledge Base of linguistic resources for Latin
1974 1980), the opera omnia of Thomas Aquinas                  connected via a common vocabulary of knowledge
written in Latin in the 13th century. Owing to its                2
                                                                     http://www.perseus.tufts.edu/hopper/
    1                                                             3
     Copyright ©2019 for this paper by its authors. Use per-         http://www.alim.dfll.univr.it/
                                                                   4
mitted under Creative Commons License Attribution 4.0 In-            Word embeddings built on tokens of the same dataset are
ternational (CC BY 4.0).                                       also available online.
description following the principles of the Linked         also been employed by Bamman (2012) to col-
Data framework.5 Our contribution provides the             lect a corpus of Latin books available from In-
community with new resources to be connected               ternet Archive. The corpus spans from 200 BCE
in the LiLa Knowledge Base aimed at support-               to the 20th century and contains 1.38 billion to-
ing data-driven socio-cultural studies of the Latin        kens: embeddings trained on this corpus7 were
world. The added value of our lemma embed-                 used to investigate the relationship between con-
dings for Latin results from the interdisciplinary         cepts and historical characters in the work of Cas-
blending of state-of-the-art methods in computa-           siodorus (Bjerva and Praet, 2015). However, these
tional linguistics with the long tradition of Latin        word vectors are affected by OCR errors present in
corpora creation: on the one hand the embeddings           the training corpus: 25% of the embedding vocab-
are evaluated with techniques hitherto applied to          ulary contains non-alphanumeric characters, e.g.
modern languages data only, on the other they are          -**-, iftudˆ. The quality of the corpus used to
built from high quality datasets heavily used by           train the Latin word embeddings available through
scholars working on Latin.                                 the SemioGraph interface8 , on the other hand, is
                                                           high: these embeddings are based on the “Compu-
2       Related Work                                       tational Historical Semantics” database, a manu-
                                                           ally curated collection of 4,000 Latin texts written
Word embeddings are crucial to many Natu-                  between the 2nd and the 15th century AD (Jussen
ral Language Processing (NLP) tasks (Collobert             and Rohmann, 2015). In SemioGraph, more than
et al., 2011; Lample et al., 2016; Yu et al.,              one hundred word vectors can be visually explored
2017). Numerous pre-trained word vectors gener-            searching by Part-of-Speech (PoS) labels and text
ated with different algorithms have been released,         genres: however, these vectors cannot be down-
typically generated from huge amounts of contem-           loaded for further analysis and were generated
porary texts written in modern languages. The in-          with one model only, i.e. word2vec.
terest towards this type of distributional approach           With respect to the works cited above, in this
has emerged also in the Digital Humanities, as evi-        paper we rely on manually lemmatized texts free
denced by publications on the use of word embed-           of OCR errors, we focus on a period not cov-
dings trained on literary texts or historical docu-        ered by the “Computational Historical Semantics”
ments (Hamilton et al., 2016; Leavy et al., 2018;          database and we test two models to learn lemma
Sprugnoli and Tonelli, 2019). Although to a lesser         representations. It is worth noting that none of the
extent, the literature also reports works on word          previously mentioned studies have carried out an
embeddings for dead languages, including Latin.            evaluation of the trained Latin embeddings; we, on
   Both Facebook and the organizers of the                 the contrary, provide both quantitative and qualita-
CoNLL shared tasks on multilingual parsing                 tive evaluations of our vectors.
have pre-computed and released word embed-
dings trained on Latin texts crawled from the web:         3       Dataset Description
the former using the fastText model on Common
Crawl and Wikipedia dumps (Grave et al., 2018a),           Our lemma vectors were trained on the “Opera
the latter applying word2vec to Common Crawl               Latina” corpus (Denooz, 2004). This textual re-
only (Zeman et al., 2018). Both resources were             source has been collected and manually annotated
developed by relying on automatic language de-             since 1961 by the Laboratoire d’Analyse Statis-
tection engines: they are very big in terms of vo-         tique des Langues Anciennes (LASLA) at the Uni-
cabulary size6 but highly noisy due to the pres-           versity of Liège9 . It includes 158 texts from 20
ence of languages other than Latin. In addition,           different Classical authors covering various gen-
they include terms related to modern times, such           res, such as treatises (e.g. “Annales” by Tacitus),
as movie stars, TV series, companies (e.g., Cum-           letters (e.g. “Epistulae” by Pliny the Younger),
berbatch, Simpson, Google), making them un-                epic poems (e.g. “Aeneis” by Virgil), elegies
suitable for the study of language use in ancient              7
                                                               http://www.cs.cmu.edu/˜dbamman/latin.
texts. The automatic detection of language has             html
                                                             8
                                                               http://semiograph.texttechnologylab.
    5
    https://lila-erc.eu/                                   org/
    6                                                        9
    For example, the size of the CoNLL embeddings vocab-       http://web.philo.ulg.ac.be/lasla/
ulary is 1,082,365 words.                                  textes-latins-traites/
    TARGET WORDS            SYNONYMS                                        DECOY WORDS
    decretum/decree     edictum/proclamation    flagitium/shameful act    adolesco/to grow up        stipendiarius/tributary
    saepe/often         crebro/frequently       conquiro/to seek for      ululatus/howling           frugifer/fertile
    rogo/to ask         oro/to ask for          columna/column            retorqueo/to twist back    errabundus/vagrant
    exilis/thin         macer/emaciated         moles/pile                mortalitas/mortality       audens/daring

             Table 1: Examples taken from the Latin benchmark for the synonym selection task.


(e.g. “Elegiae” by Propertius), plays (both come-                              word2vec                  fastText
                                                                           cbow    skip-gram         cbow     skip-gram
dies and tragedies e.g. “Aulularia” by Plautus and                  100   81.14% 79.83%             80.57% 86.91%
“Oedipus” by Seneca), and public speeches (e.g.                     300   80.86% 79.48%             79.43% 86.40%
“Philippicae” by Cicero)10 .
   The corpus contains several layers of linguis-             Table 2: Results of the synonym selection task cal-
tic annotation, such as lemmatization, PoS tagging            culated on the whole benchmark.
and tagging of inflectional features, organized in                             word2vec                  fastText
space-separated files. “Opera Latina” contains ap-                         cbow    skip-gram         cbow     skip-gram
                                                                    100   81.48% 85.18%             77.77% 87.03%
proximately 1,700,000 words (punctuation is not
                                                                    300   76.63% 85.18%             75.92% 90.74%
present in the corpus), corresponding to 133,886
unique tokens and 24,339 unique lemmas.                       Table 3: Results of the synonym selection task cal-
                                                              culated on a subset of the benchmark containing
4        Experimental Setup                                   only questions with lemmas sharing the same PoS.
We tested two different vector representations,
namely word2vec (Mikolov et al., 2013a) and fast-             (Glare, 1982) and corpora. With the minimal num-
Text (Bojanowski et al., 2017): the former is based           ber of lemma occurrences set to 5, we obtained a
on linear bag-of-words contexts generating a dis-             vocabulary size of 11,327 lemmas.
tinct vector for each word, whereas the latter is
based on a bag of character n-grams, that is, the             5     Evaluation
vector for a word (or a lemma) is the sum of its
character n-gram vectors. Lemma vectors were                  Word embeddings resulting from the experiments
pre-computed using two dimensionalities (100,                 described in the previous Section were tested per-
300) and two models: skip-gram and Continu-                   forming both an intrinsic and a qualitative evalu-
ous Bag-of-Words (CBOW). In this way, we had                  ation (Schnabel et al., 2015). To the best of our
the possibility of evaluating both modest and high            knowledge, these methods, although well docu-
dimensional vectors and two architectures: skip-              mented in the literature, have never been applied
gram is designed to predict the context given a tar-          to the evaluation of Latin embeddings.
get word, whereas CBWO predicts the target word
                                                              5.1     Synonym Selection Task
based on the context. The window size was 10
lemmas for skip-gram and 5 for CBOW. The other                In the synonym selection task, the goal is to se-
training options were the same for the two models:            lect the correct synonym of a target lemma out
                                                              of a set of possible answers (Baroni et al., 2014).
     • number of negatives sampled: 25;                       The most commonly used benchmark for this task
     • number of threads: 20;                                 is the Test of English as a Foreign Language
     • number of iterations over the corpus: 15;              (TOEFL), consisting of multiple-choice questions
     • minimal number of word occurrences: 5.                 each involving five terms: the target words and an-
Embeddings were trained on the lemmatized                     other four, one of which is a synonym of the target
“Opera Latina” in order to reduce the data sparsity           word and the remaining three decoys (Landauer
due to the high inflectional nature of Latin. More-           and Dumais, 1997). The original TOEFL dataset
over, we lower-cased the text and converted v into            is made of only 80 questions but extensions have
u (so that vir ‘man’ becomes uir) to fit the lexi-            been proposed to widen the set of multiple-choice
cographic conventions of some Latin dictionaries              questions using external resources such as Word-
    10
                                                              Net (Ehlert, 2003; Freitag et al., 2005).
     The corpus can be queried through an online interface
after requesting credentials: http://cipl93.philo.               In order to create a TOEFL-like benchmark
ulg.ac.be/OperaLatina/                                        for Latin, we relied on four digitized dictionaries
                       contrudo/to thrust            frugaliter/thriftily         auspicatus/consecrated by auspices
                       protrudo*/to thrust forward   frugalis*/thrifty            auspicato*/after taking the auspices
       fastText-skip
                       extrudo*/to thrust out        frugalitas*/economy          auspicium*/auspices
                       contego*/to cover             aliter/differently           auguratus*/the office of augur
       fastText-cbow
                       contraho/to collect           negligenter/neglectfully     pontificatus/the office of pontifex
                       infodio/to bury               frugi*/frugal                erycinus/Erycinian
       word2vec-skip
                       tabeo/to melt away            quaerito/to seek earnestly   parilia/the feast of Pales
                       refundo/to pour back          lautus/neat                  erycinus/Erycinian
       word2vec-cbow
                       infodio/to bury               frugi*/frugal                parilia/the feast of Pales

                         Table 4: Examples of the nearest neighbors of rare lemmas


of Latin synonyms (Hill, 1804; Dumesnil, 1819;                   We computed the performance of the embed-
Von Doederlein and Taylor, 1875; Skřivan, 1890)              dings by calculating the cosine similarity between
available online in XML Dictionary eXchange for-              the vector of the target lemma and that of the other
mat11 . Starting from the digital versions of the dic-        lemmas, picking the candidate with the largest co-
tionaries, we proceeded as follows:                           sine. Questions containing lemmas not included
                                                              in the vocabulary, and thus vectorless, are auto-
   • we downloaded and parsed the XML files so                matically filtered out; results are given in terms of
     as to extract only the information useful for            accuracy. As shown in Table 2, fastText proved
     our purposes, that is, the dictionary entry and          to be the best lemma representation for the syn-
     the synonyms;                                            onym selection task with the skip-gram architec-
   • we merged the content of all dictionaries                ture achieving an accuracy above 86%. This re-
     to obtain the largest possible list of lem-              sult can be explained by the fact that fastText is
     mas with their corresponding synonyms. Un-               able to model morphology by taking into consider-
     like “Opera Latina” and the other synonym                ation sub-word units (i.e. character n-grams) and
     dictionaries, Dumesnil (1819) often lemma-               joining lemmas from the same derivational fami-
     tizes verbs under the infinite form; therefore,          lies. In addition, the skip-gram architecture works
     for the sake of uniformity, we used LEM-                 well with small amounts of training data like ours.
     LAT v312 to obtain the first person, singular,           It is also worth noting that, for both architectures
     present, active (or passive, in case of depo-            and models, vectors with a modest dimensionality
     nent verbs), indicative form of all verbs reg-           achieved a slightly higher accuracy with respect to
     istered in that dictionary in their present infi-        embeddings with 300 dimensions.
     nite form (e.g. accingere ‘to gird on’→ ac-              The error analysis revealed specific types of lin-
     cingo) (Passarotti et al., 2017). At the end of          guistic and semantic relations, other than syn-
     this phase, we obtained a new resource con-              onymy, holding between the target lemma and the
     taining 2,759 unique entries and covering all            decoy lemma that resulted having the largest co-
     types of PoS, together with their synonyms;              sine: for example, meronymy (e.g., target word:
   • multiple-choice questions were created by                annalis ‘chronicles’ - synonym: historia ‘narra-
     taking each entry as a target lemma, then                tive of past events’ - answer: charta ‘paper’) and
     adding its first synonym and another three               morphological derivation (e.g. target word: con-
     lemmas randomly chosen from the “Opera                   sors ‘having a common lot’ - synonym: particeps
     Latina” corpus;                                          ‘sharer’ - answer: sors ‘lot’).
   • a Latin language expert manually checked                 As an additional analysis, we repeated our evalu-
     samples of multiple-choice questions so as to            ation on a subset of the benchmark containing 85
     be sure that the three randomly chosen lem-              questions made of lemmas sharing the same PoS,
     mas were in fact decoy lemmas.                           e.g. auxilior ‘to assist’, adiuuo ‘to help’, censeo
                                                              ‘to assess’, reuerto ‘to turn back’, humo ‘to bury’.
Table 1 provides some examples of the multiple-               Results reported in Table 3 confirm that the skip-
choice questions generated using the procedure                gram architecture provides the best accuracy for
described above .                                             this task achieving a score above 90% for fastText
  11
                                                              embeddings with 300 dimensions. We also note an
     https://github.com/nikita-moor/
latin-dictionary                                              improvement of the accuracy for word2vec (+5%).
  12
     https://github.com/CIRCSE/LEMLAT3                        The reasons behind these results need further in-
vestigations.                                            with the skip-gram architecture and 100 dimen-
                                                         sions). For a comparative analysis with the em-
5.2    Qualitative Evaluation on Rare Lemma              beddings of “Opera Latina”, we aligned the em-
       Embeddings                                        beddings of “Opera Maiora” to the same coordi-
One of the main differences between word2vec             nate axes using the unsupervised alignment algo-
and fastText is that the latter is supposed to be able   rithm provided with the fastText code (Grave et
to generate better embeddings for words that oc-         al., 2018b). Thanks to this alignment, we can in-
cur rarely in the training data. This is due to the      spect the nearest neighbors (nn) of lemmas in the
fact that rare words in word2vec have few neigh-         two embeddings. For example, the lemma ordo
bor context words from which to learn the vec-           shifts from social class or military rank (among
tor representation, whereas in fastText even rare        the top 10 nn in the “Opera Latina” embeddings
words share their character n-grams with other           we find, in this order, equester ‘cavalry’, legionar-
words, making it possible to represent them reli-        ius ‘legionary’, turmatim ‘by squadrons’) to refer-
ably. To validate this hypothesis, we performed a        ring to the concept of order and intellectual struc-
qualitative evaluation of the nearest neighbors of       ture in Thomas Aquinas (nn in “Opera Maiora”:
a small set of randomly selected lemmas appear-          ordinatio ‘setting in order’, coordinatio ‘arrang-
ing between 5 and 10 times only in the “Opera            ing together’, ordino ‘set in order’) (Busa, 1977).
Latina” corpus. Two Latin language experts man-          Another interesting case is spiritus: in the Classi-
ually checked the two most similar lemmas (in            cal era it refers to ‘breath’ (nn in “Opera Latina”:
terms of cosine similarity) induced by the different     spiro ‘to blow’, exspiro ‘to exhale’, spiramentum
100-dimension embeddings we trained. Table 4             ‘draught’), while in Aquinas’ Christian writings it
presents a sample of the selected rare lemmas and        associated with the Holy Ghost (nn: sanctio ‘to
their neighbors: an asterisk marks neighbors that        make sacred’, donum ‘gift’, paracletus ‘protec-
two experts judged as most semantically-related to       tor’) (Busa, 1983).
the target lemma. This manual inspection, even if
based on a small set of data, shows that the em-
beddings trained using the fastText model with the       7   Conclusion and Future Work
skip-gram architecture can find more similar lem-
mas that those trained with other models and ar-
chitectures.                                             In this paper we presented a new set of Latin em-
                                                         beddings based on high quality lemmatized cor-
6     A Diachronic Perspective                           pora and a new benchmark for the synonym se-
                                                         lection task. The aligned embeddings can be vi-
Diachronic analyses are particularly relevant for        sually explored through a web interface and all
Latin given that its use spans more than two mil-        the resources are freely available online: https:
lennia. To support this type of study we release,        //embeddings.lila-erc.eu.
together with the embeddings presented in the               Several future works are envisaged. For ex-
previous Sections, lemma vectors trained on the          ample, we plan to develop new benchmarks, like
“Opera Maiora”, written by Thomas Aquinas in             the analogy test (Mikolov et al., 2013b) or the
the 13th century. “Opera Maiora” is a set of philo-      rare words dataset (Luong et al., 2013), for the
sophical and religious works comprising some 4.5         intrinsic quantitative evaluation of Latin embed-
million words (Passarotti, 2015): all texts are man-     dings. Moreover, embeddings could be used to
ually lemmatized and tagged at the morphological         improve the linking of datasets in the LiLa Knowl-
level (Passarotti, 2010) and are part of the “Index      edge Base. We would also like to extend the di-
Thomisticus” (IT) corpus.                                achronic analysis to the embeddings trained on the
   Before training the embeddings, we pre-               “Computational Historical Semantics” database as
processed the texts following the conventions            soon as these become available.
adopted in “Opera Latina”: we lower-cased, re-              This work represents the first step towards the
moved punctuation, and converted v and j into u          development of a new set of resources for the anal-
and i, respectively. Embeddings were trained with        ysis of Latin. This effort is laying the foundations
the configuration that reported the best results in      of the first campaign devoted to the evaluation of
the evaluation described in Section 5 (i.e. fastText     NLP tools for Latin, EvaLatin.
Acknowledgments                                           Joseph Denooz. 2004. Opera latina: une base de
                                                             données sur internet. Euphrosyne, 32:79–88.
This work is supported by the European Research
Council (ERC) under the European Union’s Hori-            Jean Baptiste Gardin Dumesnil. 1819. Latin Syn-
                                                             onyms: With Their Different Significations: and Ex-
zon 2020 research and innovation programme via
                                                             amples Taken from the Best Latin Authors. GB
the “LiLa: Linking Latin” project - Grant Agree-             Whittaker.
ment No. 769994. The authors also wish to thank
Andrea Peverelli for his expert support on Latin          Bret R Ehlert. 2003. Making accurate lexical se-
                                                            mantic similarity judgments using word-context co-
and Chris Culy for providing his code for the em-           occurrence statistics. University of California, San
beddings visualization.                                     Diego.

                                                          Dayne Freitag, Matthias Blume, John Byrnes, Ed-
References                                                  mond Chow, Sadik Kapadia, Richard Rohwer, and
                                                            Zhiqiang Wang. 2005. New experiments in distri-
David Bamman and David Smith. 2012. Extracting              butional representations of synonymy. In Proceed-
  two thousand years of Latin from a million book li-       ings of the Ninth Conference on Computational Nat-
  brary. Journal on Computing and Cultural Heritage         ural Language Learning, pages 25–32. Association
  (JOCCH), 5(1):1–13.                                       for Computational Linguistics.
Marco Baroni, Georgiana Dinu, and Germán                 Peter G.W. Glare. 1982. Oxford latin dictionary. Ox-
 Kruszewski. 2014. Don’t count, predict! A                  ford univ. press.
 systematic comparison of context-counting vs.
 context-predicting semantic vectors. In Proceedings      Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Ar-
 of the 52nd Annual Meeting of the Association              mand Joulin, and Tomas Mikolov. 2018a. Learning
 for Computational Linguistics (Volume 1: Long              Word Vectors for 157 Languages. In Nicoletta Cal-
 Papers), volume 1, pages 238–247.                          zolari (Conference chair), Khalid Choukri, Christo-
                                                            pher Cieri, Thierry Declerck, Sara Goggi, Koiti
Monica Berti. 2019. Digital Classical Philology: An-        Hasida, Hitoshi Isahara, Bente Maegaard, Joseph
 cient Greek and Latin in the Digital Revolution, vol-      Mariani, Hlne Mazo, Asuncion Moreno, Jan Odijk,
 ume 10. Walter de Gruyter GmbH & Co KG.                    Stelios Piperidis, and Takenobu Tokunaga, editors,
Johannes Bjerva and Raf Praet. 2015. Word embed-            Proceedings of the Eleventh International Confer-
  dings pointing the way for late antiquity. In Pro-        ence on Language Resources and Evaluation (LREC
  ceedings of the 9th SIGHUM Workshop on Lan-               2018), pages 3843–3847, Miyazaki, Japan, May 7-
  guage Technology for Cultural Heritage, Social Sci-       12, 2018. European Language Resources Associa-
  ences, and Humanities (LaTeCH), pages 53–57.              tion (ELRA).

Piotr Bojanowski, Edouard Grave, Armand Joulin, and       Edouard Grave, Armand Joulin, and Quentin Berthet.
   Tomas Mikolov. 2017. Enriching word vectors with         2018b. Unsupervised Alignment of Embeddings
   subword information. Transactions of the Associa-        with Wasserstein Procrustes. pages 1880–1890.
   tion for Computational Linguistics, 5:135–146.
                                                          William L. Hamilton, Jure Leskovec, and Dan Jurafsky.
Roberto Busa. 1974-1980. Index Thomisticus: sancti          2016. Diachronic word embeddings reveal statisti-
  Thomae Aquinatis operum omnium indices et con-            cal laws of semantic change. In Proceedings of the
  cordantiae, in quibus verborum omnium et singulo-         54th Annual Meeting of the Association for Compu-
  rum formae et lemmata cum suis frequentiis et con-        tational Linguistics (Volume 1: Long Papers), pages
  textibus variis modis referuntur quaeque / consoci-       1489–1501, Berlin, Germany, August. Association
  ata plurium opera atque electronico IBM automato          for Computational Linguistics.
  usus digessit Robertus Busa SJ. Frommann - Holz-
  boog.                                                   John Hill. 1804. The Synonymes in the Latin Lan-
                                                            guage, Alphabetically Arranged; with Critical Dis-
Roberto Busa. 1977. Ordo dans les oeuvres de st.            sertations Upon the Force of Its Prepositions, Both
  thomas d’aquin. II Coll. Intern. del Lessico Intel-       in a Simple and Compounded State: By John Hill,
  lettuale Europeo, pages 59–184.                           LL. D. Professor of Humanity in the University, and
                                                            Fellow of the Royal Society, of Edinburgh. James
Roberto Busa. 1983. De voce spiritus in operibus s.         Ballantyne, for Longman and Rees, London.
  thomae aquinatis. IV Coll. Intern. del Lessico Intel-
  lettuale Europeo, pages 191–222.                        Bernhard Jussen and Gregor Rohmann. 2015. Histori-
                                                            cal Semantics in Medieval Studies: New Means and
Ronan Collobert, Jason Weston, Léon Bottou, Michael        Approaches. Contributions to the History of Con-
  Karlen, Koray Kavukcuoglu, and Pavel Kuksa.               cepts, 10(2):1–6.
  2011. Natural language processing (almost) from
  scratch. Journal of machine learning research,          Guillaume Lample, Miguel Ballesteros, Sandeep Sub-
  12(Aug):2493–2537.                                        ramanian, Kazuya Kawakami, and Chris Dyer.
  2016. Neural architectures for named entity recog-      Rachele Sprugnoli and Sara Tonelli. 2019. Novel
  nition. In Proceedings of NAACL-HLT, pages 260–           event detection and classification for historical texts.
  270.                                                      Computational Linguistics, 45(2):229–265.

Thomas K. Landauer and Susan T. Dumais. 1997.             Ludwig Von Doederlein and Samuel Harvey Taylor.
  A solution to Plato’s problem: The latent semantic        1875. Döderlein’s Hand-book of Latin Synonymes.
  analysis theory of acquisition, induction, and rep-       WF Draper.
  resentation of knowledge. Psychological review,
  104(2):211–240.                                         Liang-Chih Yu, Jin Wang, K Robert Lai, and Xuejie
                                                            Zhang. 2017. Refining word embeddings for sen-
Susan Leavy, Karen Wade, Gerardine Meaney, and              timent analysis. In Proceedings of the 2017 Con-
  Derek Greene. 2018. Navigating literary text with         ference on Empirical Methods in Natural Language
  word embeddings and semantic lexicons. In Work-           Processing, pages 534–539.
  shop on Computational Methods in the Humanities
                                                          Daniel Zeman, Jan Hajič, Martin Popel, Martin Pot-
  2018 (COMHUM 2018), Luasanne, Switzerland, 4-
                                                            thast, Milan Straka, Filip Ginter, Joakim Nivre, and
  5 June 2018.
                                                            Slav Petrov. 2018. CoNLL 2018 shared task: Mul-
                                                            tilingual parsing from raw text to universal depen-
Thang Luong, Richard Socher, and Christopher Man-
                                                            dencies. In Proceedings of the CoNLL 2018 Shared
  ning. 2013. Better word representations with recur-
                                                            Task: Multilingual Parsing from Raw Text to Univer-
  sive neural networks for morphology. In Proceed-
                                                            sal Dependencies, pages 1–21.
  ings of the Seventeenth Conference on Computa-
  tional Natural Language Learning, pages 104–113.

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey
  Dean. 2013a. Efficient estimation of word represen-
  tations in vector space. Proceedings of Workshop at
  ICLR.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-
  rado, and Jeff Dean. 2013b. Distributed represen-
  tations of words and phrases and their composition-
  ality. In Advances in neural information processing
  systems, pages 3111–3119.

Marco Passarotti, Marco Budassi, Eleonora Litta, and
 Paolo Ruffolo. 2017. The Lemlat 3.0 Package
 for Morphological Analysis of Latin. In Proceed-
 ings of the NoDaLiDa 2017 Workshop on Process-
 ing Historical Language, number 133, pages 24–31.
 Linköping University Electronic Press.

Marco Passarotti. 2010. Leaving behind the less-
 resourced status. The case of Latin through the ex-
 perience of the Index Thomisticus Treebank. In
 7th SaLTMiL Workshop on Creation and use of ba-
 sic lexical resources for less-resourced languages
 LREC 2010, Valetta, Malta, 23 May 2010 Workshop
 programme, pages 27–32.

Marco Passarotti. 2015. What you can do with lin-
 guistically annotated data. from the index thomisti-
 cus to the index thomisticus treebank. In Reading
 Sacred Scripture with Thomas Aquinas: Hermeneu-
 tical Tools, Theological Questions and New Perspec-
 tives, pages 3–44.

Tobias Schnabel, Igor Labutov, David Mimno, and
  Thorsten Joachims. 2015. Evaluation methods for
  unsupervised word embeddings. In Proceedings of
  the 2015 Conference on Empirical Methods in Nat-
  ural Language Processing, pages 298–307.

Arnošt Skřivan. 1890. Latinská synonymika pro školu
  i dum. V CHRUDIMI.