Vir is to Moderatus as Mulier is to Intemperans Lemma Embeddings for Latin Rachele Sprugnoli, Marco Passarotti, Giovanni Moretti CIRCSE Research Centre, Università Cattolica del Sacro Cuore Largo Agostino Gemelli 1, 20123 Milano {rachele.sprugnoli,marco.passarotti,giovanni.moretti}@unicatt.it Abstract wide diachronic span covering more than two mil- lennia, as well as its diatopic distribution across English. This paper presents a new set of Europe and the Mediterranean, Latin is the most lemma embeddings for the Latin language. resourced historical language with respect to the Embeddings are trained on a manually an- availability of textual corpora. Large collections notated corpus of texts belonging to the of Latin texts, e.g. the Perseus Digital Library2 Classical era: different models, architec- and the corpus of Medieval Italian Latinity ALIM 3 , tures and dimensions are tested and evalu- can now be processed with state-of-the-art com- ated using a novel benchmark for the syn- putational tools and methods to provide linguistic onym selection task. A qualitative evalua- resources that enable scholars to exploit the em- tion is also performed on the embeddings pirical evidence provided by such datasets to the of rare lemmas. In addition, we release fullest. This is particularly promising given that vectors pre-trained on the “Opera Maiora” the quality of many textual resources for Latin, by Thomas Aquinas, thus providing a re- carefully built over decades, is high. source to analyze Latin in a diachronic Recent years have seen the rise of language perspective.1 modeling and feature learning techniques applied to linguistic data, resulting in so-called “word 1 Introduction embeddings”, i.e. empirically trained vectors of Any study of the ancient world is inextricably lexical items in which words occurring in simi- bound to empirical sources, be those archaeologi- lar linguistic contexts are assigned close vectorial cal relics, artifacts or texts. Most ancient texts are space. The semantic meaningfulness and motiva- written in dead languages, one of the distinguish- tion of word embeddings stems from the basic as- ing features of which is that both their lexicon and sumption of distributional semantics, according to their textual evidence are essentially closed, with- which the distributional properties of words mir- out any new substantial addition. This finite na- ror their semantic similarities and/or differences, ture of dead languages, together with the need of so that words sharing similar contexts tend to have empirical data to their study, makes the preserva- similar meanings. tion and the careful analysis of their legacy a core In this paper, we present and evaluate a num- task of the (scientific) community. Although com- ber of embeddings for Latin built from a manu- putational and corpus linguistics have mainly fo- ally lemmatized dataset containing texts from the cused on building tools and resources for modern Classical era.4 In addition, we release embed- languages, there has always been large interest in dings trained on a manually lemmatized corpus providing scholars with collections of texts writ- of medieval texts to facilitate diachronic analyses. ten in dead or historical languages (Berti, 2019). This research is performed in the context of the Not by chance, one of the first electronic corpora LiLa: Linking Latin project, which seeks to build ever produced is the “Index Thomisticus” (Busa, a Knowledge Base of linguistic resources for Latin 1974 1980), the opera omnia of Thomas Aquinas connected via a common vocabulary of knowledge written in Latin in the 13th century. Owing to its 2 http://www.perseus.tufts.edu/hopper/ 1 3 Copyright ©2019 for this paper by its authors. Use per- http://www.alim.dfll.univr.it/ 4 mitted under Creative Commons License Attribution 4.0 In- Word embeddings built on tokens of the same dataset are ternational (CC BY 4.0). also available online. description following the principles of the Linked also been employed by Bamman (2012) to col- Data framework.5 Our contribution provides the lect a corpus of Latin books available from In- community with new resources to be connected ternet Archive. The corpus spans from 200 BCE in the LiLa Knowledge Base aimed at support- to the 20th century and contains 1.38 billion to- ing data-driven socio-cultural studies of the Latin kens: embeddings trained on this corpus7 were world. The added value of our lemma embed- used to investigate the relationship between con- dings for Latin results from the interdisciplinary cepts and historical characters in the work of Cas- blending of state-of-the-art methods in computa- siodorus (Bjerva and Praet, 2015). However, these tional linguistics with the long tradition of Latin word vectors are affected by OCR errors present in corpora creation: on the one hand the embeddings the training corpus: 25% of the embedding vocab- are evaluated with techniques hitherto applied to ulary contains non-alphanumeric characters, e.g. modern languages data only, on the other they are -**-, iftudˆ. The quality of the corpus used to built from high quality datasets heavily used by train the Latin word embeddings available through scholars working on Latin. the SemioGraph interface8 , on the other hand, is high: these embeddings are based on the “Compu- 2 Related Work tational Historical Semantics” database, a manu- ally curated collection of 4,000 Latin texts written Word embeddings are crucial to many Natu- between the 2nd and the 15th century AD (Jussen ral Language Processing (NLP) tasks (Collobert and Rohmann, 2015). In SemioGraph, more than et al., 2011; Lample et al., 2016; Yu et al., one hundred word vectors can be visually explored 2017). Numerous pre-trained word vectors gener- searching by Part-of-Speech (PoS) labels and text ated with different algorithms have been released, genres: however, these vectors cannot be down- typically generated from huge amounts of contem- loaded for further analysis and were generated porary texts written in modern languages. The in- with one model only, i.e. word2vec. terest towards this type of distributional approach With respect to the works cited above, in this has emerged also in the Digital Humanities, as evi- paper we rely on manually lemmatized texts free denced by publications on the use of word embed- of OCR errors, we focus on a period not cov- dings trained on literary texts or historical docu- ered by the “Computational Historical Semantics” ments (Hamilton et al., 2016; Leavy et al., 2018; database and we test two models to learn lemma Sprugnoli and Tonelli, 2019). Although to a lesser representations. It is worth noting that none of the extent, the literature also reports works on word previously mentioned studies have carried out an embeddings for dead languages, including Latin. evaluation of the trained Latin embeddings; we, on Both Facebook and the organizers of the the contrary, provide both quantitative and qualita- CoNLL shared tasks on multilingual parsing tive evaluations of our vectors. have pre-computed and released word embed- dings trained on Latin texts crawled from the web: 3 Dataset Description the former using the fastText model on Common Crawl and Wikipedia dumps (Grave et al., 2018a), Our lemma vectors were trained on the “Opera the latter applying word2vec to Common Crawl Latina” corpus (Denooz, 2004). This textual re- only (Zeman et al., 2018). Both resources were source has been collected and manually annotated developed by relying on automatic language de- since 1961 by the Laboratoire d’Analyse Statis- tection engines: they are very big in terms of vo- tique des Langues Anciennes (LASLA) at the Uni- cabulary size6 but highly noisy due to the pres- versity of Liège9 . It includes 158 texts from 20 ence of languages other than Latin. In addition, different Classical authors covering various gen- they include terms related to modern times, such res, such as treatises (e.g. “Annales” by Tacitus), as movie stars, TV series, companies (e.g., Cum- letters (e.g. “Epistulae” by Pliny the Younger), berbatch, Simpson, Google), making them un- epic poems (e.g. “Aeneis” by Virgil), elegies suitable for the study of language use in ancient 7 http://www.cs.cmu.edu/˜dbamman/latin. texts. The automatic detection of language has html 8 http://semiograph.texttechnologylab. 5 https://lila-erc.eu/ org/ 6 9 For example, the size of the CoNLL embeddings vocab- http://web.philo.ulg.ac.be/lasla/ ulary is 1,082,365 words. textes-latins-traites/ TARGET WORDS SYNONYMS DECOY WORDS decretum/decree edictum/proclamation flagitium/shameful act adolesco/to grow up stipendiarius/tributary saepe/often crebro/frequently conquiro/to seek for ululatus/howling frugifer/fertile rogo/to ask oro/to ask for columna/column retorqueo/to twist back errabundus/vagrant exilis/thin macer/emaciated moles/pile mortalitas/mortality audens/daring Table 1: Examples taken from the Latin benchmark for the synonym selection task. (e.g. “Elegiae” by Propertius), plays (both come- word2vec fastText cbow skip-gram cbow skip-gram dies and tragedies e.g. “Aulularia” by Plautus and 100 81.14% 79.83% 80.57% 86.91% “Oedipus” by Seneca), and public speeches (e.g. 300 80.86% 79.48% 79.43% 86.40% “Philippicae” by Cicero)10 . The corpus contains several layers of linguis- Table 2: Results of the synonym selection task cal- tic annotation, such as lemmatization, PoS tagging culated on the whole benchmark. and tagging of inflectional features, organized in word2vec fastText space-separated files. “Opera Latina” contains ap- cbow skip-gram cbow skip-gram 100 81.48% 85.18% 77.77% 87.03% proximately 1,700,000 words (punctuation is not 300 76.63% 85.18% 75.92% 90.74% present in the corpus), corresponding to 133,886 unique tokens and 24,339 unique lemmas. Table 3: Results of the synonym selection task cal- culated on a subset of the benchmark containing 4 Experimental Setup only questions with lemmas sharing the same PoS. We tested two different vector representations, namely word2vec (Mikolov et al., 2013a) and fast- (Glare, 1982) and corpora. With the minimal num- Text (Bojanowski et al., 2017): the former is based ber of lemma occurrences set to 5, we obtained a on linear bag-of-words contexts generating a dis- vocabulary size of 11,327 lemmas. tinct vector for each word, whereas the latter is based on a bag of character n-grams, that is, the 5 Evaluation vector for a word (or a lemma) is the sum of its character n-gram vectors. Lemma vectors were Word embeddings resulting from the experiments pre-computed using two dimensionalities (100, described in the previous Section were tested per- 300) and two models: skip-gram and Continu- forming both an intrinsic and a qualitative evalu- ous Bag-of-Words (CBOW). In this way, we had ation (Schnabel et al., 2015). To the best of our the possibility of evaluating both modest and high knowledge, these methods, although well docu- dimensional vectors and two architectures: skip- mented in the literature, have never been applied gram is designed to predict the context given a tar- to the evaluation of Latin embeddings. get word, whereas CBWO predicts the target word 5.1 Synonym Selection Task based on the context. The window size was 10 lemmas for skip-gram and 5 for CBOW. The other In the synonym selection task, the goal is to se- training options were the same for the two models: lect the correct synonym of a target lemma out of a set of possible answers (Baroni et al., 2014). • number of negatives sampled: 25; The most commonly used benchmark for this task • number of threads: 20; is the Test of English as a Foreign Language • number of iterations over the corpus: 15; (TOEFL), consisting of multiple-choice questions • minimal number of word occurrences: 5. each involving five terms: the target words and an- Embeddings were trained on the lemmatized other four, one of which is a synonym of the target “Opera Latina” in order to reduce the data sparsity word and the remaining three decoys (Landauer due to the high inflectional nature of Latin. More- and Dumais, 1997). The original TOEFL dataset over, we lower-cased the text and converted v into is made of only 80 questions but extensions have u (so that vir ‘man’ becomes uir) to fit the lexi- been proposed to widen the set of multiple-choice cographic conventions of some Latin dictionaries questions using external resources such as Word- 10 Net (Ehlert, 2003; Freitag et al., 2005). The corpus can be queried through an online interface after requesting credentials: http://cipl93.philo. In order to create a TOEFL-like benchmark ulg.ac.be/OperaLatina/ for Latin, we relied on four digitized dictionaries contrudo/to thrust frugaliter/thriftily auspicatus/consecrated by auspices protrudo*/to thrust forward frugalis*/thrifty auspicato*/after taking the auspices fastText-skip extrudo*/to thrust out frugalitas*/economy auspicium*/auspices contego*/to cover aliter/differently auguratus*/the office of augur fastText-cbow contraho/to collect negligenter/neglectfully pontificatus/the office of pontifex infodio/to bury frugi*/frugal erycinus/Erycinian word2vec-skip tabeo/to melt away quaerito/to seek earnestly parilia/the feast of Pales refundo/to pour back lautus/neat erycinus/Erycinian word2vec-cbow infodio/to bury frugi*/frugal parilia/the feast of Pales Table 4: Examples of the nearest neighbors of rare lemmas of Latin synonyms (Hill, 1804; Dumesnil, 1819; We computed the performance of the embed- Von Doederlein and Taylor, 1875; Skřivan, 1890) dings by calculating the cosine similarity between available online in XML Dictionary eXchange for- the vector of the target lemma and that of the other mat11 . Starting from the digital versions of the dic- lemmas, picking the candidate with the largest co- tionaries, we proceeded as follows: sine. Questions containing lemmas not included in the vocabulary, and thus vectorless, are auto- • we downloaded and parsed the XML files so matically filtered out; results are given in terms of as to extract only the information useful for accuracy. As shown in Table 2, fastText proved our purposes, that is, the dictionary entry and to be the best lemma representation for the syn- the synonyms; onym selection task with the skip-gram architec- • we merged the content of all dictionaries ture achieving an accuracy above 86%. This re- to obtain the largest possible list of lem- sult can be explained by the fact that fastText is mas with their corresponding synonyms. Un- able to model morphology by taking into consider- like “Opera Latina” and the other synonym ation sub-word units (i.e. character n-grams) and dictionaries, Dumesnil (1819) often lemma- joining lemmas from the same derivational fami- tizes verbs under the infinite form; therefore, lies. In addition, the skip-gram architecture works for the sake of uniformity, we used LEM- well with small amounts of training data like ours. LAT v312 to obtain the first person, singular, It is also worth noting that, for both architectures present, active (or passive, in case of depo- and models, vectors with a modest dimensionality nent verbs), indicative form of all verbs reg- achieved a slightly higher accuracy with respect to istered in that dictionary in their present infi- embeddings with 300 dimensions. nite form (e.g. accingere ‘to gird on’→ ac- The error analysis revealed specific types of lin- cingo) (Passarotti et al., 2017). At the end of guistic and semantic relations, other than syn- this phase, we obtained a new resource con- onymy, holding between the target lemma and the taining 2,759 unique entries and covering all decoy lemma that resulted having the largest co- types of PoS, together with their synonyms; sine: for example, meronymy (e.g., target word: • multiple-choice questions were created by annalis ‘chronicles’ - synonym: historia ‘narra- taking each entry as a target lemma, then tive of past events’ - answer: charta ‘paper’) and adding its first synonym and another three morphological derivation (e.g. target word: con- lemmas randomly chosen from the “Opera sors ‘having a common lot’ - synonym: particeps Latina” corpus; ‘sharer’ - answer: sors ‘lot’). • a Latin language expert manually checked As an additional analysis, we repeated our evalu- samples of multiple-choice questions so as to ation on a subset of the benchmark containing 85 be sure that the three randomly chosen lem- questions made of lemmas sharing the same PoS, mas were in fact decoy lemmas. e.g. auxilior ‘to assist’, adiuuo ‘to help’, censeo ‘to assess’, reuerto ‘to turn back’, humo ‘to bury’. Table 1 provides some examples of the multiple- Results reported in Table 3 confirm that the skip- choice questions generated using the procedure gram architecture provides the best accuracy for described above . this task achieving a score above 90% for fastText 11 embeddings with 300 dimensions. We also note an https://github.com/nikita-moor/ latin-dictionary improvement of the accuracy for word2vec (+5%). 12 https://github.com/CIRCSE/LEMLAT3 The reasons behind these results need further in- vestigations. with the skip-gram architecture and 100 dimen- sions). For a comparative analysis with the em- 5.2 Qualitative Evaluation on Rare Lemma beddings of “Opera Latina”, we aligned the em- Embeddings beddings of “Opera Maiora” to the same coordi- One of the main differences between word2vec nate axes using the unsupervised alignment algo- and fastText is that the latter is supposed to be able rithm provided with the fastText code (Grave et to generate better embeddings for words that oc- al., 2018b). Thanks to this alignment, we can in- cur rarely in the training data. This is due to the spect the nearest neighbors (nn) of lemmas in the fact that rare words in word2vec have few neigh- two embeddings. For example, the lemma ordo bor context words from which to learn the vec- shifts from social class or military rank (among tor representation, whereas in fastText even rare the top 10 nn in the “Opera Latina” embeddings words share their character n-grams with other we find, in this order, equester ‘cavalry’, legionar- words, making it possible to represent them reli- ius ‘legionary’, turmatim ‘by squadrons’) to refer- ably. To validate this hypothesis, we performed a ring to the concept of order and intellectual struc- qualitative evaluation of the nearest neighbors of ture in Thomas Aquinas (nn in “Opera Maiora”: a small set of randomly selected lemmas appear- ordinatio ‘setting in order’, coordinatio ‘arrang- ing between 5 and 10 times only in the “Opera ing together’, ordino ‘set in order’) (Busa, 1977). Latina” corpus. Two Latin language experts man- Another interesting case is spiritus: in the Classi- ually checked the two most similar lemmas (in cal era it refers to ‘breath’ (nn in “Opera Latina”: terms of cosine similarity) induced by the different spiro ‘to blow’, exspiro ‘to exhale’, spiramentum 100-dimension embeddings we trained. Table 4 ‘draught’), while in Aquinas’ Christian writings it presents a sample of the selected rare lemmas and associated with the Holy Ghost (nn: sanctio ‘to their neighbors: an asterisk marks neighbors that make sacred’, donum ‘gift’, paracletus ‘protec- two experts judged as most semantically-related to tor’) (Busa, 1983). the target lemma. This manual inspection, even if based on a small set of data, shows that the em- beddings trained using the fastText model with the 7 Conclusion and Future Work skip-gram architecture can find more similar lem- mas that those trained with other models and ar- chitectures. In this paper we presented a new set of Latin em- beddings based on high quality lemmatized cor- 6 A Diachronic Perspective pora and a new benchmark for the synonym se- lection task. The aligned embeddings can be vi- Diachronic analyses are particularly relevant for sually explored through a web interface and all Latin given that its use spans more than two mil- the resources are freely available online: https: lennia. To support this type of study we release, //embeddings.lila-erc.eu. together with the embeddings presented in the Several future works are envisaged. For ex- previous Sections, lemma vectors trained on the ample, we plan to develop new benchmarks, like “Opera Maiora”, written by Thomas Aquinas in the analogy test (Mikolov et al., 2013b) or the the 13th century. “Opera Maiora” is a set of philo- rare words dataset (Luong et al., 2013), for the sophical and religious works comprising some 4.5 intrinsic quantitative evaluation of Latin embed- million words (Passarotti, 2015): all texts are man- dings. Moreover, embeddings could be used to ually lemmatized and tagged at the morphological improve the linking of datasets in the LiLa Knowl- level (Passarotti, 2010) and are part of the “Index edge Base. We would also like to extend the di- Thomisticus” (IT) corpus. achronic analysis to the embeddings trained on the Before training the embeddings, we pre- “Computational Historical Semantics” database as processed the texts following the conventions soon as these become available. adopted in “Opera Latina”: we lower-cased, re- This work represents the first step towards the moved punctuation, and converted v and j into u development of a new set of resources for the anal- and i, respectively. Embeddings were trained with ysis of Latin. This effort is laying the foundations the configuration that reported the best results in of the first campaign devoted to the evaluation of the evaluation described in Section 5 (i.e. fastText NLP tools for Latin, EvaLatin. Acknowledgments Joseph Denooz. 2004. Opera latina: une base de données sur internet. Euphrosyne, 32:79–88. This work is supported by the European Research Council (ERC) under the European Union’s Hori- Jean Baptiste Gardin Dumesnil. 1819. Latin Syn- onyms: With Their Different Significations: and Ex- zon 2020 research and innovation programme via amples Taken from the Best Latin Authors. GB the “LiLa: Linking Latin” project - Grant Agree- Whittaker. ment No. 769994. The authors also wish to thank Andrea Peverelli for his expert support on Latin Bret R Ehlert. 2003. Making accurate lexical se- mantic similarity judgments using word-context co- and Chris Culy for providing his code for the em- occurrence statistics. University of California, San beddings visualization. Diego. Dayne Freitag, Matthias Blume, John Byrnes, Ed- References mond Chow, Sadik Kapadia, Richard Rohwer, and Zhiqiang Wang. 2005. New experiments in distri- David Bamman and David Smith. 2012. Extracting butional representations of synonymy. In Proceed- two thousand years of Latin from a million book li- ings of the Ninth Conference on Computational Nat- brary. Journal on Computing and Cultural Heritage ural Language Learning, pages 25–32. Association (JOCCH), 5(1):1–13. for Computational Linguistics. Marco Baroni, Georgiana Dinu, and Germán Peter G.W. Glare. 1982. Oxford latin dictionary. Ox- Kruszewski. 2014. Don’t count, predict! A ford univ. press. systematic comparison of context-counting vs. context-predicting semantic vectors. In Proceedings Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Ar- of the 52nd Annual Meeting of the Association mand Joulin, and Tomas Mikolov. 2018a. Learning for Computational Linguistics (Volume 1: Long Word Vectors for 157 Languages. In Nicoletta Cal- Papers), volume 1, pages 238–247. zolari (Conference chair), Khalid Choukri, Christo- pher Cieri, Thierry Declerck, Sara Goggi, Koiti Monica Berti. 2019. Digital Classical Philology: An- Hasida, Hitoshi Isahara, Bente Maegaard, Joseph cient Greek and Latin in the Digital Revolution, vol- Mariani, Hlne Mazo, Asuncion Moreno, Jan Odijk, ume 10. Walter de Gruyter GmbH & Co KG. Stelios Piperidis, and Takenobu Tokunaga, editors, Johannes Bjerva and Raf Praet. 2015. Word embed- Proceedings of the Eleventh International Confer- dings pointing the way for late antiquity. In Pro- ence on Language Resources and Evaluation (LREC ceedings of the 9th SIGHUM Workshop on Lan- 2018), pages 3843–3847, Miyazaki, Japan, May 7- guage Technology for Cultural Heritage, Social Sci- 12, 2018. European Language Resources Associa- ences, and Humanities (LaTeCH), pages 53–57. tion (ELRA). Piotr Bojanowski, Edouard Grave, Armand Joulin, and Edouard Grave, Armand Joulin, and Quentin Berthet. Tomas Mikolov. 2017. Enriching word vectors with 2018b. Unsupervised Alignment of Embeddings subword information. Transactions of the Associa- with Wasserstein Procrustes. pages 1880–1890. tion for Computational Linguistics, 5:135–146. William L. Hamilton, Jure Leskovec, and Dan Jurafsky. Roberto Busa. 1974-1980. Index Thomisticus: sancti 2016. Diachronic word embeddings reveal statisti- Thomae Aquinatis operum omnium indices et con- cal laws of semantic change. In Proceedings of the cordantiae, in quibus verborum omnium et singulo- 54th Annual Meeting of the Association for Compu- rum formae et lemmata cum suis frequentiis et con- tational Linguistics (Volume 1: Long Papers), pages textibus variis modis referuntur quaeque / consoci- 1489–1501, Berlin, Germany, August. Association ata plurium opera atque electronico IBM automato for Computational Linguistics. usus digessit Robertus Busa SJ. Frommann - Holz- boog. John Hill. 1804. The Synonymes in the Latin Lan- guage, Alphabetically Arranged; with Critical Dis- Roberto Busa. 1977. Ordo dans les oeuvres de st. sertations Upon the Force of Its Prepositions, Both thomas d’aquin. II Coll. Intern. del Lessico Intel- in a Simple and Compounded State: By John Hill, lettuale Europeo, pages 59–184. LL. D. Professor of Humanity in the University, and Fellow of the Royal Society, of Edinburgh. James Roberto Busa. 1983. De voce spiritus in operibus s. Ballantyne, for Longman and Rees, London. thomae aquinatis. IV Coll. Intern. del Lessico Intel- lettuale Europeo, pages 191–222. Bernhard Jussen and Gregor Rohmann. 2015. Histori- cal Semantics in Medieval Studies: New Means and Ronan Collobert, Jason Weston, Léon Bottou, Michael Approaches. Contributions to the History of Con- Karlen, Koray Kavukcuoglu, and Pavel Kuksa. cepts, 10(2):1–6. 2011. Natural language processing (almost) from scratch. Journal of machine learning research, Guillaume Lample, Miguel Ballesteros, Sandeep Sub- 12(Aug):2493–2537. ramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural architectures for named entity recog- Rachele Sprugnoli and Sara Tonelli. 2019. Novel nition. In Proceedings of NAACL-HLT, pages 260– event detection and classification for historical texts. 270. Computational Linguistics, 45(2):229–265. Thomas K. Landauer and Susan T. Dumais. 1997. Ludwig Von Doederlein and Samuel Harvey Taylor. A solution to Plato’s problem: The latent semantic 1875. Döderlein’s Hand-book of Latin Synonymes. analysis theory of acquisition, induction, and rep- WF Draper. resentation of knowledge. Psychological review, 104(2):211–240. Liang-Chih Yu, Jin Wang, K Robert Lai, and Xuejie Zhang. 2017. Refining word embeddings for sen- Susan Leavy, Karen Wade, Gerardine Meaney, and timent analysis. In Proceedings of the 2017 Con- Derek Greene. 2018. Navigating literary text with ference on Empirical Methods in Natural Language word embeddings and semantic lexicons. In Work- Processing, pages 534–539. shop on Computational Methods in the Humanities Daniel Zeman, Jan Hajič, Martin Popel, Martin Pot- 2018 (COMHUM 2018), Luasanne, Switzerland, 4- thast, Milan Straka, Filip Ginter, Joakim Nivre, and 5 June 2018. Slav Petrov. 2018. CoNLL 2018 shared task: Mul- tilingual parsing from raw text to universal depen- Thang Luong, Richard Socher, and Christopher Man- dencies. In Proceedings of the CoNLL 2018 Shared ning. 2013. Better word representations with recur- Task: Multilingual Parsing from Raw Text to Univer- sive neural networks for morphology. In Proceed- sal Dependencies, pages 1–21. ings of the Seventeenth Conference on Computa- tional Natural Language Learning, pages 104–113. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word represen- tations in vector space. Proceedings of Workshop at ICLR. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- rado, and Jeff Dean. 2013b. Distributed represen- tations of words and phrases and their composition- ality. In Advances in neural information processing systems, pages 3111–3119. Marco Passarotti, Marco Budassi, Eleonora Litta, and Paolo Ruffolo. 2017. The Lemlat 3.0 Package for Morphological Analysis of Latin. In Proceed- ings of the NoDaLiDa 2017 Workshop on Process- ing Historical Language, number 133, pages 24–31. Linköping University Electronic Press. Marco Passarotti. 2010. Leaving behind the less- resourced status. The case of Latin through the ex- perience of the Index Thomisticus Treebank. In 7th SaLTMiL Workshop on Creation and use of ba- sic lexical resources for less-resourced languages LREC 2010, Valetta, Malta, 23 May 2010 Workshop programme, pages 27–32. Marco Passarotti. 2015. What you can do with lin- guistically annotated data. from the index thomisti- cus to the index thomisticus treebank. In Reading Sacred Scripture with Thomas Aquinas: Hermeneu- tical Tools, Theological Questions and New Perspec- tives, pages 3–44. Tobias Schnabel, Igor Labutov, David Mimno, and Thorsten Joachims. 2015. Evaluation methods for unsupervised word embeddings. In Proceedings of the 2015 Conference on Empirical Methods in Nat- ural Language Processing, pages 298–307. Arnošt Skřivan. 1890. Latinská synonymika pro školu i dum. V CHRUDIMI.