<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Vir is to Moderatus as Mulier is to Intemperans Lemma Embeddings for Latin</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Rachele Sprugnoli</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marco Passarotti</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giovanni Moretti</string-name>
          <email>giovanni.morettig@unicatt.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CIRCSE Research Centre, Universita` Cattolica del Sacro Cuore Largo Agostino Gemelli 1</institution>
          ,
          <addr-line>20123 Milano</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>1819</year>
      </pub-date>
      <fpage>25</fpage>
      <lpage>32</lpage>
      <abstract>
        <p>English. This paper presents a new set of lemma embeddings for the Latin language. Embeddings are trained on a manually annotated corpus of texts belonging to the Classical era: different models, architectures and dimensions are tested and evaluated using a novel benchmark for the synonym selection task. A qualitative evaluation is also performed on the embeddings of rare lemmas. In addition, we release vectors pre-trained on the “Opera Maiora” by Thomas Aquinas, thus providing a resource to analyze Latin in a diachronic perspective.1</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        Any study of the ancient world is inextricably
bound to empirical sources, be those
archaeological relics, artifacts or texts. Most ancient texts are
written in dead languages, one of the
distinguishing features of which is that both their lexicon and
their textual evidence are essentially closed,
without any new substantial addition. This finite
nature of dead languages, together with the need of
empirical data to their study, makes the
preservation and the careful analysis of their legacy a core
task of the (scientific) community. Although
computational and corpus linguistics have mainly
focused on building tools and resources for modern
languages, there has always been large interest in
providing scholars with collections of texts
written in dead or historical languages
        <xref ref-type="bibr" rid="ref3">(Berti, 2019)</xref>
        .
Not by chance, one of the first electronic corpora
ever produced is the “Index Thomisticus”
        <xref ref-type="bibr" rid="ref6">(Busa,
1974 1980)</xref>
        , the opera omnia of Thomas Aquinas
written in Latin in the 13th century. Owing to its
1Copyright ©2019 for this paper by its authors. Use
permitted under Creative Commons License Attribution 4.0
International (CC BY 4.0).
wide diachronic span covering more than two
millennia, as well as its diatopic distribution across
Europe and the Mediterranean, Latin is the most
resourced historical language with respect to the
availability of textual corpora. Large collections
of Latin texts, e.g. the Perseus Digital Library2
and the corpus of Medieval Italian Latinity ALIM3,
can now be processed with state-of-the-art
computational tools and methods to provide linguistic
resources that enable scholars to exploit the
empirical evidence provided by such datasets to the
fullest. This is particularly promising given that
the quality of many textual resources for Latin,
carefully built over decades, is high.
      </p>
      <p>Recent years have seen the rise of language
modeling and feature learning techniques applied
to linguistic data, resulting in so-called “word
embeddings”, i.e. empirically trained vectors of
lexical items in which words occurring in
similar linguistic contexts are assigned close vectorial
space. The semantic meaningfulness and
motivation of word embeddings stems from the basic
assumption of distributional semantics, according to
which the distributional properties of words
mirror their semantic similarities and/or differences,
so that words sharing similar contexts tend to have
similar meanings.</p>
      <p>In this paper, we present and evaluate a
number of embeddings for Latin built from a
manually lemmatized dataset containing texts from the
Classical era.4 In addition, we release
embeddings trained on a manually lemmatized corpus
of medieval texts to facilitate diachronic analyses.
This research is performed in the context of the
LiLa: Linking Latin project, which seeks to build
a Knowledge Base of linguistic resources for Latin
connected via a common vocabulary of knowledge
2http://www.perseus.tufts.edu/hopper/
3http://www.alim.dfll.univr.it/
4Word embeddings built on tokens of the same dataset are
also available online.
description following the principles of the Linked
Data framework.5 Our contribution provides the
community with new resources to be connected
in the LiLa Knowledge Base aimed at
supporting data-driven socio-cultural studies of the Latin
world. The added value of our lemma
embeddings for Latin results from the interdisciplinary
blending of state-of-the-art methods in
computational linguistics with the long tradition of Latin
corpora creation: on the one hand the embeddings
are evaluated with techniques hitherto applied to
modern languages data only, on the other they are
built from high quality datasets heavily used by
scholars working on Latin.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        Word embeddings are crucial to many
Natural Language Processing (NLP) tasks
        <xref ref-type="bibr" rid="ref29 ref9">(Collobert
et al., 2011; Lample et al., 2016; Yu et al.,
2017)</xref>
        . Numerous pre-trained word vectors
generated with different algorithms have been released,
typically generated from huge amounts of
contemporary texts written in modern languages. The
interest towards this type of distributional approach
has emerged also in the Digital Humanities, as
evidenced by publications on the use of word
embeddings trained on literary texts or historical
documents
        <xref ref-type="bibr" rid="ref13 ref17 ref19">(Hamilton et al., 2016; Leavy et al., 2018;
Sprugnoli and Tonelli, 2019)</xref>
        . Although to a lesser
extent, the literature also reports works on word
embeddings for dead languages, including Latin.
      </p>
      <p>
        Both Facebook and the organizers of the
CoNLL shared tasks on multilingual parsing
have pre-computed and released word
embeddings trained on Latin texts crawled from the web:
the former using the fastText model on Common
Crawl and Wikipedia dumps
        <xref ref-type="bibr" rid="ref11 ref12">(Grave et al., 2018a)</xref>
        ,
the latter applying word2vec to Common Crawl
only
        <xref ref-type="bibr" rid="ref30">(Zeman et al., 2018)</xref>
        . Both resources were
developed by relying on automatic language
detection engines: they are very big in terms of
vocabulary size6 but highly noisy due to the
presence of languages other than Latin. In addition,
they include terms related to modern times, such
as movie stars, TV series, companies (e.g.,
Cumberbatch, Simpson, Google), making them
unsuitable for the study of language use in ancient
texts. The automatic detection of language has
5https://lila-erc.eu/
6For example, the size of the CoNLL embeddings
vocabulary is 1,082,365 words.
also been employed by Bamman (2012) to
collect a corpus of Latin books available from
Internet Archive. The corpus spans from 200 BCE
to the 20th century and contains 1.38 billion
tokens: embeddings trained on this corpus7 were
used to investigate the relationship between
concepts and historical characters in the work of
Cassiodorus
        <xref ref-type="bibr" rid="ref15 ref26 ref4">(Bjerva and Praet, 2015)</xref>
        . However, these
word vectors are affected by OCR errors present in
the training corpus: 25% of the embedding
vocabulary contains non-alphanumeric characters, e.g.
-**-, iftudˆ. The quality of the corpus used to
train the Latin word embeddings available through
the SemioGraph interface8, on the other hand, is
high: these embeddings are based on the
“Computational Historical Semantics” database, a
manually curated collection of 4,000 Latin texts written
between the 2nd and the 15th century AD
        <xref ref-type="bibr" rid="ref15 ref26 ref4">(Jussen
and Rohmann, 2015)</xref>
        . In SemioGraph, more than
one hundred word vectors can be visually explored
searching by Part-of-Speech (PoS) labels and text
genres: however, these vectors cannot be
downloaded for further analysis and were generated
with one model only, i.e. word2vec.
      </p>
      <p>With respect to the works cited above, in this
paper we rely on manually lemmatized texts free
of OCR errors, we focus on a period not
covered by the “Computational Historical Semantics”
database and we test two models to learn lemma
representations. It is worth noting that none of the
previously mentioned studies have carried out an
evaluation of the trained Latin embeddings; we, on
the contrary, provide both quantitative and
qualitative evaluations of our vectors.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Dataset Description</title>
      <p>
        Our lemma vectors were trained on the “Opera
Latina” corpus
        <xref ref-type="bibr" rid="ref10">(Denooz, 2004)</xref>
        . This textual
resource has been collected and manually annotated
since 1961 by the Laboratoire d’Analyse
Statistique des Langues Anciennes (LASLA) at the
University of Lie`ge9. It includes 158 texts from 20
different Classical authors covering various
genres, such as treatises (e.g. “Annales” by Tacitus),
letters (e.g. “Epistulae” by Pliny the Younger),
epic poems (e.g. “Aeneis” by Virgil), elegies
7http://www.cs.cmu.edu/˜dbamman/latin.
html
      </p>
      <p>8http://semiograph.texttechnologylab.
org/</p>
      <p>9http://web.philo.ulg.ac.be/lasla/
textes-latins-traites/
(e.g. “Elegiae” by Propertius), plays (both
comedies and tragedies e.g. “Aulularia” by Plautus and
“Oedipus” by Seneca), and public speeches (e.g.
“Philippicae” by Cicero)10.</p>
      <p>The corpus contains several layers of
linguistic annotation, such as lemmatization, PoS tagging
and tagging of inflectional features, organized in
space-separated files. “Opera Latina” contains
approximately 1,700,000 words (punctuation is not
present in the corpus), corresponding to 133,886
unique tokens and 24,339 unique lemmas.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Experimental Setup</title>
      <p>
        We tested two different vector representations,
namely word2vec
        <xref ref-type="bibr" rid="ref21 ref22">(Mikolov et al., 2013a)</xref>
        and
fastText
        <xref ref-type="bibr" rid="ref5">(Bojanowski et al., 2017)</xref>
        : the former is based
on linear bag-of-words contexts generating a
distinct vector for each word, whereas the latter is
based on a bag of character n-grams, that is, the
vector for a word (or a lemma) is the sum of its
character n-gram vectors. Lemma vectors were
pre-computed using two dimensionalities (100,
300) and two models: skip-gram and
Continuous Bag-of-Words (CBOW). In this way, we had
the possibility of evaluating both modest and high
dimensional vectors and two architectures:
skipgram is designed to predict the context given a
target word, whereas CBWO predicts the target word
based on the context. The window size was 10
lemmas for skip-gram and 5 for CBOW. The other
training options were the same for the two models:
• number of negatives sampled: 25;
• number of threads: 20;
• number of iterations over the corpus: 15;
• minimal number of word occurrences: 5.
Embeddings were trained on the lemmatized
“Opera Latina” in order to reduce the data sparsity
due to the high inflectional nature of Latin.
Moreover, we lower-cased the text and converted v into
u (so that vir ‘man’ becomes uir) to fit the
lexicographic conventions of some Latin dictionaries
10The corpus can be queried through an online interface
after requesting credentials: http://cipl93.philo.
ulg.ac.be/OperaLatina/
(Glare, 1982) and corpora. With the minimal
number of lemma occurrences set to 5, we obtained a
vocabulary size of 11,327 lemmas.
5
      </p>
    </sec>
    <sec id="sec-5">
      <title>Evaluation</title>
      <p>
        Word embeddings resulting from the experiments
described in the previous Section were tested
performing both an intrinsic and a qualitative
evaluation
        <xref ref-type="bibr" rid="ref26">(Schnabel et al., 2015)</xref>
        . To the best of our
knowledge, these methods, although well
documented in the literature, have never been applied
to the evaluation of Latin embeddings.
5.1
      </p>
      <sec id="sec-5-1">
        <title>Synonym Selection Task</title>
        <p>
          In the synonym selection task, the goal is to
select the correct synonym of a target lemma out
of a set of possible answers
          <xref ref-type="bibr" rid="ref2">(Baroni et al., 2014)</xref>
          .
        </p>
        <p>
          The most commonly used benchmark for this task
is the Test of English as a Foreign Language
(TOEFL), consisting of multiple-choice questions
each involving five terms: the target words and
another four, one of which is a synonym of the target
word and the remaining three decoys
          <xref ref-type="bibr" rid="ref18">(Landauer
and Dumais, 1997)</xref>
          . The original TOEFL dataset
is made of only 80 questions but extensions have
been proposed to widen the set of multiple-choice
questions using external resources such as
WordNet (Ehlert, 2003; Freitag et al., 2005).
        </p>
        <p>
          In order to create a TOEFL-like benchmark
for Latin, we relied on four digitized dictionaries
fastText-cbow
word2vec-skip
word2vec-cbow
contrudo/to thrust frugaliter/thriftily auspicatus/consecrated by auspices
protrudo*/to thrust forward frugalis*/thrifty auspicato*/after taking the auspices
extrudo*/to thrust out frugalitas*/economy auspicium*/auspices
contego*/to cover aliter/differently auguratus*/the office of augur
contraho/to collect negligenter/neglectfully pontificatus/the office of pontifex
infodio/to bury frugi*/frugal erycinus/Erycinian
tabeo/to melt away quaerito/to seek earnestly parilia/the feast of Pales
refundo/to pour back lautus/neat erycinus/Erycinian
infodio/to bury frugi*/frugal parilia/the feast of Pales
of Latin synonyms
          <xref ref-type="bibr" rid="ref14 ref28">(Hill, 1804; Dumesnil, 1819;
Von Doederlein and Taylor, 1875; Skrˇivan, 1890)</xref>
          available online in XML Dictionary eXchange
format11. Starting from the digital versions of the
dictionaries, we proceeded as follows:
• we downloaded and parsed the XML files so
as to extract only the information useful for
our purposes, that is, the dictionary entry and
the synonyms;
• we merged the content of all dictionaries
to obtain the largest possible list of
lemmas with their corresponding synonyms.
Unlike “Opera Latina” and the other synonym
dictionaries, Dumesnil (1819) often
lemmatizes verbs under the infinite form; therefore,
for the sake of uniformity, we used
LEMLAT v312 to obtain the first person, singular,
present, active (or passive, in case of
deponent verbs), indicative form of all verbs
registered in that dictionary in their present
infinite form (e.g. accingere ‘to gird on’!
accingo)
          <xref ref-type="bibr" rid="ref23">(Passarotti et al., 2017)</xref>
          . At the end of
this phase, we obtained a new resource
containing 2,759 unique entries and covering all
types of PoS, together with their synonyms;
• multiple-choice questions were created by
taking each entry as a target lemma, then
adding its first synonym and another three
lemmas randomly chosen from the “Opera
        </p>
        <p>Latina” corpus;
• a Latin language expert manually checked
samples of multiple-choice questions so as to
be sure that the three randomly chosen
lemmas were in fact decoy lemmas.</p>
        <p>Table 1 provides some examples of the
multiplechoice questions generated using the procedure
described above .</p>
        <p>11https://github.com/nikita-moor/
latin-dictionary
12https://github.com/CIRCSE/LEMLAT3</p>
        <p>We computed the performance of the
embeddings by calculating the cosine similarity between
the vector of the target lemma and that of the other
lemmas, picking the candidate with the largest
cosine. Questions containing lemmas not included
in the vocabulary, and thus vectorless, are
automatically filtered out; results are given in terms of
accuracy. As shown in Table 2, fastText proved
to be the best lemma representation for the
synonym selection task with the skip-gram
architecture achieving an accuracy above 86%. This
result can be explained by the fact that fastText is
able to model morphology by taking into
consideration sub-word units (i.e. character n-grams) and
joining lemmas from the same derivational
families. In addition, the skip-gram architecture works
well with small amounts of training data like ours.</p>
        <p>It is also worth noting that, for both architectures
and models, vectors with a modest dimensionality
achieved a slightly higher accuracy with respect to
embeddings with 300 dimensions.</p>
        <p>The error analysis revealed specific types of
linguistic and semantic relations, other than
synonymy, holding between the target lemma and the
decoy lemma that resulted having the largest
cosine: for example, meronymy (e.g., target word:
annalis ‘chronicles’ - synonym: historia
‘narrative of past events’ - answer: charta ‘paper’) and
morphological derivation (e.g. target word:
consors ‘having a common lot’ - synonym: particeps
‘sharer’ - answer: sors ‘lot’).</p>
        <p>As an additional analysis, we repeated our
evaluation on a subset of the benchmark containing 85
questions made of lemmas sharing the same PoS,
e.g. auxilior ‘to assist’, adiuuo ‘to help’, censeo
‘to assess’, reuerto ‘to turn back’, humo ‘to bury’.</p>
        <p>Results reported in Table 3 confirm that the
skipgram architecture provides the best accuracy for
this task achieving a score above 90% for fastText
embeddings with 300 dimensions. We also note an
improvement of the accuracy for word2vec (+5%).</p>
        <p>The reasons behind these results need further
in5.2</p>
      </sec>
      <sec id="sec-5-2">
        <title>Qualitative Evaluation on Rare Lemma</title>
      </sec>
      <sec id="sec-5-3">
        <title>Embeddings</title>
        <p>One of the main differences between word2vec
and fastText is that the latter is supposed to be able
to generate better embeddings for words that
occur rarely in the training data. This is due to the
fact that rare words in word2vec have few
neighbor context words from which to learn the
vector representation, whereas in fastText even rare
words share their character n-grams with other
words, making it possible to represent them
reliably. To validate this hypothesis, we performed a
qualitative evaluation of the nearest neighbors of
a small set of randomly selected lemmas
appearing between 5 and 10 times only in the “Opera
Latina” corpus. Two Latin language experts
manually checked the two most similar lemmas (in
terms of cosine similarity) induced by the different
100-dimension embeddings we trained. Table 4
presents a sample of the selected rare lemmas and
their neighbors: an asterisk marks neighbors that
two experts judged as most semantically-related to
the target lemma. This manual inspection, even if
based on a small set of data, shows that the
embeddings trained using the fastText model with the
skip-gram architecture can find more similar
lemmas that those trained with other models and
architectures.
6</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>A Diachronic Perspective</title>
      <p>
        Diachronic analyses are particularly relevant for
Latin given that its use spans more than two
millennia. To support this type of study we release,
together with the embeddings presented in the
previous Sections, lemma vectors trained on the
“Opera Maiora”, written by Thomas Aquinas in
the 13th century. “Opera Maiora” is a set of
philosophical and religious works comprising some 4.5
million words
        <xref ref-type="bibr" rid="ref25">(Passarotti, 2015)</xref>
        : all texts are
manually lemmatized and tagged at the morphological
level
        <xref ref-type="bibr" rid="ref24">(Passarotti, 2010)</xref>
        and are part of the “Index
Thomisticus” (IT) corpus.
      </p>
      <p>
        Before training the embeddings, we
preprocessed the texts following the conventions
adopted in “Opera Latina”: we lower-cased,
removed punctuation, and converted v and j into u
and i, respectively. Embeddings were trained with
the configuration that reported the best results in
the evaluation described in Section 5 (i.e. fastText
with the skip-gram architecture and 100
dimensions). For a comparative analysis with the
embeddings of “Opera Latina”, we aligned the
embeddings of “Opera Maiora” to the same
coordinate axes using the unsupervised alignment
algorithm provided with the fastText code
        <xref ref-type="bibr" rid="ref11 ref12">(Grave et
al., 2018b)</xref>
        . Thanks to this alignment, we can
inspect the nearest neighbors (nn) of lemmas in the
two embeddings. For example, the lemma ordo
shifts from social class or military rank (among
the top 10 nn in the “Opera Latina” embeddings
we find, in this order, equester ‘cavalry’,
legionarius ‘legionary’, turmatim ‘by squadrons’) to
referring to the concept of order and intellectual
structure in Thomas Aquinas (nn in “Opera Maiora”:
ordinatio ‘setting in order’, coordinatio
‘arranging together’, ordino ‘set in order’)
        <xref ref-type="bibr" rid="ref7">(Busa, 1977)</xref>
        .
Another interesting case is spiritus: in the
Classical era it refers to ‘breath’ (nn in “Opera Latina”:
spiro ‘to blow’, exspiro ‘to exhale’, spiramentum
‘draught’), while in Aquinas’ Christian writings it
associated with the Holy Ghost (nn: sanctio ‘to
make sacred’, donum ‘gift’, paracletus
‘protector’)
        <xref ref-type="bibr" rid="ref8">(Busa, 1983)</xref>
        .
7
      </p>
    </sec>
    <sec id="sec-7">
      <title>Conclusion and Future Work</title>
      <p>In this paper we presented a new set of Latin
embeddings based on high quality lemmatized
corpora and a new benchmark for the synonym
selection task. The aligned embeddings can be
visually explored through a web interface and all
the resources are freely available online: https:
//embeddings.lila-erc.eu.</p>
      <p>
        Several future works are envisaged. For
example, we plan to develop new benchmarks, like
the analogy test
        <xref ref-type="bibr" rid="ref21 ref22">(Mikolov et al., 2013b)</xref>
        or the
rare words dataset
        <xref ref-type="bibr" rid="ref20">(Luong et al., 2013)</xref>
        , for the
intrinsic quantitative evaluation of Latin
embeddings. Moreover, embeddings could be used to
improve the linking of datasets in the LiLa
Knowledge Base. We would also like to extend the
diachronic analysis to the embeddings trained on the
“Computational Historical Semantics” database as
soon as these become available.
      </p>
      <p>This work represents the first step towards the
development of a new set of resources for the
analysis of Latin. This effort is laying the foundations
of the first campaign devoted to the evaluation of
NLP tools for Latin, EvaLatin.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>This work is supported by the European Research
Council (ERC) under the European Union’s
Horizon 2020 research and innovation programme via
the “LiLa: Linking Latin” project - Grant
Agreement No. 769994. The authors also wish to thank
Andrea Peverelli for his expert support on Latin
and Chris Culy for providing his code for the
embeddings visualization.</p>
      <p>Bret R Ehlert. 2003. Making accurate lexical
semantic similarity judgments using word-context
cooccurrence statistics. University of California, San
Diego.</p>
      <p>Peter G.W. Glare. 1982. Oxford latin dictionary.
Oxford univ. press.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>David</given-names>
            <surname>Bamman</surname>
          </string-name>
          and
          <string-name>
            <given-names>David</given-names>
            <surname>Smith</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>Extracting two thousand years of Latin from a million book library</article-title>
          .
          <source>Journal on Computing and Cultural Heritage (JOCCH)</source>
          ,
          <volume>5</volume>
          (
          <issue>1</issue>
          ):
          <fpage>1</fpage>
          -
          <lpage>13</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Marco</given-names>
            <surname>Baroni</surname>
          </string-name>
          ,
          <source>Georgiana Dinu, and Germa´n Kruszewski</source>
          .
          <year>2014</year>
          .
          <article-title>Don't count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors</article-title>
          .
          <source>In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          , volume
          <volume>1</volume>
          , pages
          <fpage>238</fpage>
          -
          <lpage>247</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>Monica</given-names>
            <surname>Berti</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Digital Classical Philology: Ancient Greek and Latin in the Digital Revolution</article-title>
          , volume
          <volume>10</volume>
          .
          <string-name>
            <surname>Walter de Gruyter GmbH</surname>
          </string-name>
          &amp; Co KG.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>Johannes</given-names>
            <surname>Bjerva</surname>
          </string-name>
          and
          <string-name>
            <given-names>Raf</given-names>
            <surname>Praet</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Word embeddings pointing the way for late antiquity</article-title>
          .
          <source>In Proceedings of the 9th SIGHUM Workshop on Language Technology for Cultural Heritage</source>
          ,
          <source>Social Sciences, and Humanities (LaTeCH)</source>
          , pages
          <fpage>53</fpage>
          -
          <lpage>57</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Piotr</given-names>
            <surname>Bojanowski</surname>
          </string-name>
          , Edouard Grave, Armand Joulin, and
          <string-name>
            <given-names>Tomas</given-names>
            <surname>Mikolov</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Enriching word vectors with subword information</article-title>
          .
          <source>Transactions of the Association for Computational Linguistics</source>
          ,
          <volume>5</volume>
          :
          <fpage>135</fpage>
          -
          <lpage>146</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>Roberto</given-names>
            <surname>Busa</surname>
          </string-name>
          .
          <year>1974</year>
          -
          <fpage>1980</fpage>
          .
          <article-title>Index Thomisticus: sancti Thomae Aquinatis operum omnium indices et concordantiae, in quibus verborum omnium et singulorum formae et lemmata cum suis frequentiis et contextibus variis modis referuntur quaeque / consociata plurium opera atque electronico IBM automato usus digessit Robertus Busa SJ</article-title>
          . Frommann - Holzboog.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>Roberto</given-names>
            <surname>Busa</surname>
          </string-name>
          .
          <year>1977</year>
          .
          <article-title>Ordo dans les oeuvres de st</article-title>
          .
          <source>thomas d'aquin. II Coll. Intern. del Lessico Intellettuale Europeo</source>
          , pages
          <fpage>59</fpage>
          -
          <lpage>184</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>Roberto</given-names>
            <surname>Busa</surname>
          </string-name>
          .
          <year>1983</year>
          .
          <article-title>De voce spiritus in operibus s. thomae aquinatis</article-title>
          .
          <source>IV Coll. Intern. del Lessico Intellettuale Europeo</source>
          , pages
          <fpage>191</fpage>
          -
          <lpage>222</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>Ronan</given-names>
            <surname>Collobert</surname>
          </string-name>
          , Jason Weston, Le´on Bottou, Michael Karlen, Koray Kavukcuoglu, and
          <string-name>
            <given-names>Pavel</given-names>
            <surname>Kuksa</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Natural language processing (almost) from scratch</article-title>
          .
          <source>Journal of machine learning research</source>
          ,
          <volume>12</volume>
          (Aug):
          <fpage>2493</fpage>
          -
          <lpage>2537</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>Joseph</given-names>
            <surname>Denooz</surname>
          </string-name>
          .
          <year>2004</year>
          .
          <article-title>Opera latina: une base de donne´es sur internet</article-title>
          .
          <source>Euphrosyne</source>
          ,
          <volume>32</volume>
          :
          <fpage>79</fpage>
          -
          <lpage>88</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>Edouard</given-names>
            <surname>Grave</surname>
          </string-name>
          , Piotr Bojanowski, Prakhar Gupta,
          <source>Armand Joulin, and Tomas Mikolov. 2018a. Learning Word Vectors for 157 Languages. In Nicoletta Calzolari (Conference chair)</source>
          ,
          <source>Khalid Choukri</source>
          , Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hlne Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis, and Takenobu Tokunaga, editors,
          <source>Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC</source>
          <year>2018</year>
          ), pages
          <fpage>3843</fpage>
          -
          <lpage>3847</lpage>
          , Miyazaki, Japan, May 7-
          <issue>12</issue>
          ,
          <year>2018</year>
          .
          <string-name>
            <given-names>European</given-names>
            <surname>Language Resources Association (ELRA).</surname>
          </string-name>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <given-names>Edouard</given-names>
            <surname>Grave</surname>
          </string-name>
          , Armand Joulin, and
          <string-name>
            <given-names>Quentin</given-names>
            <surname>Berthet</surname>
          </string-name>
          . 2018b.
          <article-title>Unsupervised Alignment of Embeddings with Wasserstein Procrustes</article-title>
          . pages
          <fpage>1880</fpage>
          -
          <lpage>1890</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <surname>William L. Hamilton</surname>
            , Jure Leskovec, and
            <given-names>Dan</given-names>
          </string-name>
          <string-name>
            <surname>Jurafsky</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Diachronic word embeddings reveal statistical laws of semantic change</article-title>
          .
          <source>In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)</source>
          , pages
          <fpage>1489</fpage>
          -
          <lpage>1501</lpage>
          , Berlin, Germany, August. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <given-names>John</given-names>
            <surname>Hill</surname>
          </string-name>
          .
          <year>1804</year>
          .
          <article-title>The Synonymes in the Latin Language, Alphabetically Arranged; with Critical Dissertations Upon the Force of Its Prepositions, Both in a Simple and Compounded State: By John Hill, LL. D. Professor of Humanity in the University, and Fellow of the Royal Society</article-title>
          , of Edinburgh. James Ballantyne,
          <source>for Longman and Rees</source>
          , London.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <given-names>Bernhard</given-names>
            <surname>Jussen</surname>
          </string-name>
          and
          <string-name>
            <given-names>Gregor</given-names>
            <surname>Rohmann</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Historical Semantics in Medieval Studies: New Means and Approaches</article-title>
          . Contributions to the
          <source>History of Concepts</source>
          ,
          <volume>10</volume>
          (
          <issue>2</issue>
          ):
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          2016.
          <article-title>Neural architectures for named entity recognition</article-title>
          .
          <source>In Proceedings of NAACL-HLT</source>
          , pages
          <fpage>260</fpage>
          -
          <lpage>270</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <given-names>Rachele</given-names>
            <surname>Sprugnoli</surname>
          </string-name>
          and
          <string-name>
            <given-names>Sara</given-names>
            <surname>Tonelli</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Novel event detection and classification for historical texts</article-title>
          .
          <source>Computational Linguistics</source>
          ,
          <volume>45</volume>
          (
          <issue>2</issue>
          ):
          <fpage>229</fpage>
          -
          <lpage>265</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <surname>Thomas</surname>
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Landauer</surname>
          </string-name>
          and
          <string-name>
            <surname>Susan</surname>
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Dumais</surname>
          </string-name>
          .
          <year>1997</year>
          .
          <article-title>A solution to Plato's problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge</article-title>
          .
          <source>Psychological review</source>
          ,
          <volume>104</volume>
          (
          <issue>2</issue>
          ):
          <fpage>211</fpage>
          -
          <lpage>240</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <given-names>Susan</given-names>
            <surname>Leavy</surname>
          </string-name>
          , Karen Wade, Gerardine Meaney, and
          <string-name>
            <given-names>Derek</given-names>
            <surname>Greene</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Navigating literary text with word embeddings and semantic lexicons</article-title>
          .
          <source>In Workshop on Computational Methods in the Humanities 2018 (COMHUM</source>
          <year>2018</year>
          ), Luasanne, Switzerland,
          <fpage>4</fpage>
          -
          <lpage>5</lpage>
          June 2018.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <surname>Thang</surname>
            <given-names>Luong</given-names>
          </string-name>
          , Richard Socher, and
          <string-name>
            <given-names>Christopher</given-names>
            <surname>Manning</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Better word representations with recursive neural networks for morphology</article-title>
          .
          <source>In Proceedings of the Seventeenth Conference on Computational Natural Language Learning</source>
          , pages
          <fpage>104</fpage>
          -
          <lpage>113</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <given-names>Tomas</given-names>
            <surname>Mikolov</surname>
          </string-name>
          , Kai Chen, Greg Corrado, and
          <string-name>
            <given-names>Jeffrey</given-names>
            <surname>Dean</surname>
          </string-name>
          .
          <year>2013a</year>
          .
          <article-title>Efficient estimation of word representations in vector space</article-title>
          .
          <source>Proceedings of Workshop</source>
          at ICLR.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <given-names>Tomas</given-names>
            <surname>Mikolov</surname>
          </string-name>
          , Ilya Sutskever, Kai Chen, Greg S Corrado, and
          <string-name>
            <given-names>Jeff</given-names>
            <surname>Dean</surname>
          </string-name>
          .
          <year>2013b</year>
          .
          <article-title>Distributed representations of words and phrases and their compositionality</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          , pages
          <fpage>3111</fpage>
          -
          <lpage>3119</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <string-name>
            <given-names>Marco</given-names>
            <surname>Passarotti</surname>
          </string-name>
          , Marco Budassi, Eleonora Litta, and
          <string-name>
            <given-names>Paolo</given-names>
            <surname>Ruffolo</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>The Lemlat 3.0 Package for Morphological Analysis of Latin</article-title>
          .
          <source>In Proceedings of the NoDaLiDa 2017 Workshop on Processing Historical Language, number 133</source>
          , pages
          <fpage>24</fpage>
          -
          <lpage>31</lpage>
          . Linko¨ping University Electronic Press.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <string-name>
            <given-names>Marco</given-names>
            <surname>Passarotti</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Leaving behind the lessresourced status. The case of Latin through the experience of the Index Thomisticus Treebank</article-title>
          . In 7th SaLTMiL Workshop on Creation and
          <article-title>use of basic lexical resources for less-resourced languages LREC 2010, Valetta</article-title>
          , Malta, 23 May 2010 Workshop programme, pages
          <fpage>27</fpage>
          -
          <lpage>32</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <string-name>
            <given-names>Marco</given-names>
            <surname>Passarotti</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>What you can do with linguistically annotated data. from the index thomisticus to the index thomisticus treebank</article-title>
          . In Reading Sacred Scripture with Thomas Aquinas: Hermeneutical Tools,
          <source>Theological Questions and New Perspectives</source>
          , pages
          <fpage>3</fpage>
          -
          <lpage>44</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <string-name>
            <given-names>Tobias</given-names>
            <surname>Schnabel</surname>
          </string-name>
          , Igor Labutov, David Mimno,
          <string-name>
            <given-names>and Thorsten</given-names>
            <surname>Joachims</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Evaluation methods for unsupervised word embeddings</article-title>
          .
          <source>In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing</source>
          , pages
          <fpage>298</fpage>
          -
          <lpage>307</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <source>Arnosˇt Skrˇivan</source>
          .
          <year>1890</year>
          .
          <article-title>Latinska´ synonymika pro sˇkolu i dum</article-title>
          . V CHRUDIMI.
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <source>Ludwig Von Doederlein and Samuel Harvey Taylor</source>
          .
          <year>1875</year>
          .
          <article-title>Do¨derlein's Hand-book of Latin Synonymes</article-title>
          . WF Draper.
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          <string-name>
            <surname>Liang-Chih</surname>
            <given-names>Yu</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Jin</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K Robert</given-names>
            <surname>Lai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Xuejie</given-names>
            <surname>Zhang</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Refining word embeddings for sentiment analysis</article-title>
          .
          <source>In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing</source>
          , pages
          <fpage>534</fpage>
          -
          <lpage>539</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          <string-name>
            <given-names>Daniel</given-names>
            <surname>Zeman</surname>
          </string-name>
          , Jan Hajicˇ, Martin Popel, Martin Potthast, Milan Straka, Filip Ginter, Joakim Nivre, and
          <string-name>
            <given-names>Slav</given-names>
            <surname>Petrov</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>CoNLL 2018 shared task: Multilingual parsing from raw text to universal dependencies</article-title>
          .
          <source>In Proceedings of the CoNLL</source>
          <year>2018</year>
          <article-title>Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies</article-title>
          , pages
          <fpage>1</fpage>
          -
          <lpage>21</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>