=Paper= {{Paper |id=Vol-2402/paper2 |storemode=property |title=The LiLa Knowledge Base of Linguistic Resources and NLP Tools for Latin |pdfUrl=https://ceur-ws.org/Vol-2402/paper2.pdf |volume=Vol-2402 |authors=Marco C. Passarotti,Flavio M. Cecchini,Greta Franzini,Eleonora Litta,Francesco Mambrini,Paolo Ruffolo |dblpUrl=https://dblp.org/rec/conf/ldk/PassarottiCFLMR19 }} ==The LiLa Knowledge Base of Linguistic Resources and NLP Tools for Latin== https://ceur-ws.org/Vol-2402/paper2.pdf
The LiLa Knowledge Base of Linguistic Resources
and NLP Tools for Latin
Marco C. Passarotti1
CIRCSE, Università Cattolica del Sacro Cuore, Milan, Italy
marco.passarotti@unicatt.it
Flavio M. Cecchini
CIRCSE, Università Cattolica del Sacro Cuore, Milan, Italy
flavio.cecchini@unicatt.it
Greta Franzini
CIRCSE, Università Cattolica del Sacro Cuore, Milan, Italy
greta.franzini@unicatt.it
Eleonora Litta
CIRCSE, Università Cattolica del Sacro Cuore, Milan, Italy
eleonoramaria.litta@unicatt.it
Francesco Mambrini
CIRCSE, Università Cattolica del Sacro Cuore, Milan, Italy
francesco.mambrini@unicatt.it
Paolo Ruffolo
CIRCSE, Università Cattolica del Sacro Cuore, Milan, Italy
paolo.ruffolo@posteo.net

       Abstract
The LiLa: Linking Latin project was recently awarded funding from the European Research Council
to build a Knowledge Base of linguistic resources for Latin. LiLa responds to the growing need in the
fields of Computational Linguistics, Humanities Computing and Classics to create an interoperable
ecosystem of resources and Natural Language Processing tools for Latin. To this end, LiLa makes
use of Linked Open Data practices and standards to connect words to distributed textual and lexical
resources via unique identifiers. In so doing, it builds rich knowledge graphs, which can be used for
research and teaching purposes alike. This paper details the architecture of the LiLa Knowledge
Base and presents the solutions found to address the challenges raised by populating it with a first
set of linguistic resources.

2012 ACM Subject Classification Information systems → Ontologies; Information systems →
Graph-based database models; Information systems → Semantic web description languages; Applied
computing → Digital libraries and archives; Applied computing → Annotation

Keywords and phrases Latin, Linguistics, Linked Open Data, NLP, Metadata, Graph

Funding Marco C. Passarotti: [ERC-2017-COG]
Flavio M. Cecchini: [ERC-2017-COG]
Greta Franzini: [ERC-2017-COG]
Eleonora Litta: [ERC-2017-COG]
Francesco Mambrini: [ERC-2017-COG]
Paolo Ruffolo: [ERC-2017-COG]

Acknowledgements The LiLa project has received funding from the European Research Council
(ERC) under the European Union’s Horizon 2020 research and innovation programme – Grant
Agreement No. 769994.


1
    Corresponding author.
            © Marco C. Passarotti, Flavio M. Cecchini, Greta Franzini, Eleonora Litta, Francesco Mambrini,
            Paolo Ruffolo;
            licensed under Creative Commons License CC-BY
LDK 2019 - Posters Track.
Editors: Thierry Declerck and John P. McCrae
XX:2   LiLa: Linking Latin


           1   Introduction

       Despite the proliferation and the increasing coverage of linguistic resources for many languages,
       the interoperability issues imposed by their different formats severely limits their potential
       for exploitation and use. Indeed, linking linguistic resources to one another would maximize
       their contribution to, and use in, linguistic analysis at multiple levels, be those lexical,
       morphological, syntactic, semantic or pragmatic.
           The objective of the LiLa: Linking Latin project (2018-2023)2 is to connect and, ultimately,
       exploit the wealth of linguistic resources and Natural Language Processing (NLP) tools for
       Latin developed thus far, in order to bridge the gap between raw language data, NLP and
       knowledge description [4, p. 111]. Latin is an optimal use case for this kind of research for
       two reasons: (a) the diachrony and diversity of the language present complex challenges for
       NLP; (b) an interconnected network of the numerous linguistic resources currently available
       for Latin would greatly support both research and learning communities, including historians,
       philologists, archaeologists and literary scholars.
          LiLa addresses this challenge by building a Linked Data Knowledge Base of linguistic
       resources (e.g., corpora, lexica, ontologies, dictionaries, thesauri) and NLP tools (e.g.,
       tokenizers, lemmatizers, PoS-taggers, morphological analyzers and dependency parsers) for
       Latin currently available from different providers under various licences. This paper details
       the architecture of the LiLa Knowledge Base and presents the solutions found to address the
       challenges raised by populating it with a first set of linguistic resources.



           2   The LiLa Knowledge Base

       In order to achieve interoperability between resources and tools, LiLa makes use of a set
       of Semantic Web and Linguistic Linked Open Data standards. These include ontologies to
       describe linguistic annotation (OLiA [3]), corpus annotation (NIF [6], CoNLL2RDF [2]) and
       lexical resources (Lemon [1], Ontolex3 ). The Resource Description Framework (RDF) [7]
       is used to encode graph-based data structures to represent linguistic annotations in terms
       of triples. The SPARQL language is used to query the data recorded in the form of RDF
       triples [12].
           The LiLa Knowledge Base is lexically-based and strikes a balance between feasibility and
       granularity: textual resources are made of (occurrences of) words, lexical resources describe
       properties of words, and NLP tools process words. Lemma is the key node type in LiLa. A
       Lemma is an (inflected) Form conventionally chosen as the citation form of a lexical item.
       Lemmas occur in Lexical Resources as canonical forms of lexical entries. Forms, too, can
       occur in lexical resources, for instance in a lexicon containing all of the forms of a language
       (for instance, [13]). The occurrences of Forms in real texts are Tokens, which are provided
       by Textual Resources. Texts in Textual Resources can be different editions or versions of
       the same work (e.g., the numerous editions of the Orator by Cicero, which may be available
       from different Textual Resources). Finally, NLP tools process either Forms, regardless of
       their contextual use (e.g., a morphological analyzer), or Tokens (e.g., a PoS-tagger).



       2
           https://lila-erc.eu/
       3
           https://www.w3.org/community/ontolex/
M. C. Passarotti, F. M. Cecchini, G. Franzini, E. Litta, F. Mambrini, P. Ruffolo                        XX:3


2.1    Harmonizing Different Lemmatization Strategies
Because the lemma serves as the optimal interface between lexical resources, annotated
corpora and NLP tools, the core of the LiLa Knowledge Base is a collection of citation forms.
Interoperability can be achieved by linking the entries in lexical resources and the corpus
tokens pointing to the same lemma.
   The task of building and organizing a repository of lemmas that may serve as a hub in
such an architecture is, however, complicated by the fact that different corpora, lexica or
tools for Latin may adopt different strategies to solve the conceptual and linguistic challenges
posed by lemmatization. These include:

   different citation forms of the same word, resulting from interchange in (a) graphical
   representation (voluptas vs. uoluptas, “satisfaction”), (b) spelling (sulphur vs. sulfur,
   “brimstone”), (c) ending (diameter vs. diametros vs. diametrus, “diameter”) or (d) the
   paradigmatic slot representing the lemma (sequor, “to follow”, first person singular of the
   passive/deponent present indicative vs. sequo, first person singular of the active present
   indicative, attested in some lexicographical sources);
   the existence of homographic lemmas, like occido (occı̄do < ob + caedo, “to strike down”)
   vs. occido (occı̆do < ob + cado, “to fall down”);
   ambiguity in choosing the lemma: certain forms, such as participles or deadjectival
   adverbs, can be considered either part of the inflectional paradigm of verbs or adjectives,
   or independent lemmas provided with an autonomous entry in lexical resources;
   polythematic words, for which missing forms are taken from other stems, as is the case
   for melior used as a comparative of bonus (i.q. the English “good” and “better”).

     When dealing with homographs, corpora may choose to index the different entries,
but, generally, the string of the lemma is not disambiguated. Participles can either be
lemmatized under the main verb, or have a dedicated participial lemma, which in turn
may be used systematically or only when the participle has grown into an autonomous
lexical item (e.g. doctus, “learned”, morphologically the past participle of doceo, “to teach”).
Deadjectival adverbs (e.g. aequaliter, “evenly” from aequalis, “equal”) or peculiar forms such
as comparatives (both regular and irregular) are sometimes subsumed under the (positive
degree of the) adjective, or given a self-standing lemma.
     Given the challenges and the degree of variation raised by different lemmatization strategies
for Latin, our approach is to be as descriptive and inclusive as possible: our aim is to collect
as many word forms as may be used for lemmatization and attempt to model their relations.
To do so, we rely on a series of ontologies for lexical resources to describe the word forms
used in lemmatization, and turn to the Web Ontology Language (OWL) for ontologies to
model the relations between them ([9]).
     Building on the Ontolex ontology, we define a Lemma as a Form of a word. In this way,
lexical resources compiled using the Ontolex or Lemon formalism can already be connected
to our collection. Forms have one or more written representations and are linked to one or
more Parts of Speech (PoS). PoS are linked to the appropriate OLiA concepts, and we plan
to represent the most widespread Latin PoS-tagging tagsets via dedicated OLiA ontologies.
     The relations between the lemma and the other forms of the same word are defined
horizontally, i.e. via direct relations between forms. While the architecture is ready to
accommodate all of the attested or morphologically possible inflected forms of a lexical item,
it is currently being populated only with those forms that are potentially used as lemmas,
thus shaping LiLa’s previously mentioned core.



                                                                                                     L D K Po s t e r s
XX:4   LiLa: Linking Latin




            Figure 1 Connecting tokens and lemmas with different written representations in LiLa.


           The reference list of Latin lemmas is taken from that provided by the Latin morphological
       analyzer Lemlat [11]. Specifically, following the practice of Lemlat, we define a special subclass
       of lemmas, called “hypolemmas”, to harmonize different strategies for the lemmatization of
       participles. Hypolemmas are defined as forms of the inflectional paradigm of a word that
       may be used in annotated corpora or by NLP tools to lemmatize certain forms instead of
       the main lemma, i.e. the nominal inflected forms of verbal paradigms (participles, gerunds,
       gerundives, supines). As a result, we have generated hypolemmas for all the canonical forms
       of present, future and perfect participles and have connected them with their main (verbal)
       lemma via a subclass of the property “Form variant”of the Lemon ontology.4 Thus, for
       instance, the present participle subsistens, “taking a stand” is hypolemma of the main lemma
       subsisto, “to take a stand”. The same subclass is also used for alternative paradigmatic slots
       representing that lemma.
           Systematic graphical variations (e.g. u/v) are preprocessed automatically, whereas
       changes in spelling and ending are managed as different written representations of the same
       lemma. For instance, Figure 1 shows how the token diametrorum (provided by a textual
       resource) is connected to LiLa via the lemma. In its source text, diametrorum is assigned a
       PoS (NOUN) and a lemma (diameter). A string match is found between the string used to
       lemmatize the token and one of the three written representations of a LiLa Lemma. On the
       basis of this string match, the token diametrorum is connected to the lemma diameter via
       the relation hasLemma.

       2.2     Linguistic Resources in LiLa
       The linguistic resources currently linked in the LiLa Knowledge Base are stored in a triplestore
       using the Jena framework; the Fuseki component exposes the data as a SPARQL end-point
       accessible over HTTP. The current prototype of the LiLa RDF triplestore database connects
       the following resources: (a) the collection of lemmas provided by Lemlat, (b) the morphological
       derivation lexicon Word Formation Latin (WFL) [8], (c) the PROIEL Latin Treebank [5]
       in its Universal Dependencies (UD) version (release 2.3)5 and (d) the Index Thomisticus
       Treebank in both its UD 2.3 and original format [10].


       4
           https://www.lemon-model.net/lemon-cookbook/node17.html
       5
           http://universaldependencies.org/
M. C. Passarotti, F. M. Cecchini, G. Franzini, E. Litta, F. Mambrini, P. Ruffolo                     XX:5


    An example SPARQL query traversing all of these resources might search for all tokens (a)
whose lemma is a noun including the suffix -(t)or for nomina agentis / instrumenti (sources:
Lemlat and WFL), (b) that are assigned dependency relation nsubj (nominal subject), and
(c) that depend directly on a node of a verb in the UD tree of the sentence in which they
occur (source: PROIEL UD 2.3). The output provides the list of all noun/verb couples
resulting from the query, sorted in descending order of frequency (see code below).6


        Listing 1 A SPARQL query in LiLa.
    PREFIX : < http :// lila - erc . eu / data / ontologies / lemlat - base # >
    PREFIX rdfs : < http :// www . w3 . org /2000/01/ rdf - schema # >
    PREFIX ontolex : < http :// www . w3 . org / ns / lemon / ontolex # >
    PREFIX conll :
        < http :// ufal . mff . cuni . cz / conll2009 - st / task - description . html # >

    SELECT ? headlab ? deplab ( count (*) as ? tot ) WHERE {
      SERVICE < http :// lila - erc . eu :3030/ lemlat / sparql > {
         ? suff a : Suffix .
         ? suff rdfs : label " -( t ) or " .
         ? lemma : hasSuffix ? suff .
         ? lemma ontolex : writtenRep ? deplab . }
      ? tok : hasLemma ? lemma .
      GRAPH < http :// lila - erc . eu :3030/ corpora / data / la - proiel - ud > {
           ? tok conll : EDGE " nsubj " .
           ? tok conll : HEAD ? head .
           ? head conll : UPOS " VERB " . }
      ? head : hasLemma ? l
      SERVICE < http :// lila - erc . eu :3030/ lemlat / sparql > {
         ? l ontolex : writtenRep ? headlab . } }
    GROUP BY ? headlab ? deplab
    ORDER by desc (? tot )




    3      Conclusion

In this paper, we have introduced the architecture of the LiLa Knowledge Base, which is
being built in accordance with the Linked Data paradigm to foster interoperability between
linguistic resources for Latin. In particular, we focused on the challenges introduced by the
harmonization of the different lemmatization strategies adopted by annotated corpora.
    Given the central role of the Lemma in LiLa, the project is developing a strategy to
automatically PoS tag and lemmatize the (many) corpora of Latin texts that are still free of
this level of linguistic annotation. Indeed, despite the availability of NLP tools (and trained
models) for automatic PoS tagging, lemmatization, morphological analysis and dependency
parsing, their large-scale application to Latin textual resources is severely limited by their
low degree of portability across two millennia of language change. This large diachronic and
diatopic span serves as a perfect use-case for the development, application, and testing of
solutions capable of providing equally good accuracy rates for all Latin “types”.



6
     The query can be run at https://lila-erc.eu/data/ using the /corpora endpoint.



                                                                                                  L D K Po s t e r s
XX:6   LiLa: Linking Latin


            References
        1   Paul Buitelaar, Philipp Cimiano, John McCrae, Elena Montiel-Ponsoda, and Thierry Declerck.
            Ontology lexicalisation: The lemon perspective. In WS 2 Workshop Extended Abstracts, 9th
            International Conference on Terminology and Artificial Intelligence, pages 33–36, 2011.
        2   Christian Chiarcos and Christian Fäth. CoNLL-RDF: Linked Corpora Done in an NLP-
            Friendly Way. In Jorge Gracia, Francis Bond, John P. McCrae, Paul Buitelaar, Christian
            Chiarcos, and Sebastian Hellmann, editors, Language, Data, and Knowledge, pages 74–88,
            Cham, 2017. Springer International Publishing. URL: https://link.springer.com/content/
            pdf/10.1007%2F978-3-319-59888-8_6.pdf.
        3   Christian Chiarcos and Maria Sukhareva. OLiA - Ontologies of Linguistic Annotation. Semantic
            Web Journal, 6(4):379–386, 2015. URL: http://www.semantic-web-journal.net/content/
            olia-%E2%80%93-ontologies-linguistic-annotation.
        4   Thierry Declerck, Piroska Lendvai, Karlheinz Mörth, Gerhard Budin, and Tamás Váradi.
            Towards linked language data for digital humanities. In Linked Data in Linguistics, pages
            109–116. Springer, 2012.
        5   Dag TT Haug and Marius Jøhndal. Creating a parallel treebank of the old indo-european bible
            translations. In Proceedings of the Second Workshop on Language Technology for Cultural
            Heritage Data (LaTeCH 2008), pages 27–34, 2008.
        6   Sebastian Hellmann, Jens Lehmann, Sören Auer, and Martin Brümmer. Integrating NLP using
            Linked Data. In 12th International Semantic Web Conference, Sydney, Australia, October
            21-25, 2013, 2013. URL: https://svn.aksw.org/papers/2013/ISWC_NIF/public.pdf.
        7   Ora Lassila, Ralph R. Swick, World Wide, and Web Consortium. Resource description
            framework (rdf) model and syntax specification, 1998.
        8   Eleonora Litta, Marco Passarotti, and Chris Culy. Formatio formosa est. Building a Word
            Formation Lexicon for Latin. In Proceedings of the Third Italian Conference on Computational
            Linguistics (CLiC–it 2016), Napoli, Italy, December 5-7, 2016, volume Vol-1749, pages 185–189,
            2016. URL: http://ceur-ws.org/Vol-1749/paper32.pdf.
        9   Deborah L McGuinness, Frank Van Harmelen, et al. Owl web ontology language overview.
            W3C recommendation, 10(10):2004, 2004.
       10   Marco Passarotti. Theory and practice of corpus annotation in the index thomisticus treebank.
            Lexis, 27(A):5–23, 2009.
       11   Marco Passarotti, Marco Budassi, Eleonora Litta, and Paolo Ruffolo. The Lemlat 3.0 Package
            for Morphological Analysis of Latin. In Proceedings of the NoDaLiDa 2017 Workshop on
            Processing Historical Language, volume 133, pages 24–31. Linköping University Electronic Press,
            2017. URL: http://www.ep.liu.se/ecp/article.asp?issue=133&article=006&volume=.
       12   Eric Prud’Hommeaux, Andy Seaborne, et al. Sparql query language for rdf. w3c. Internet:
            https://www.w3.org/TR/rdf-sparql-query/[Accessed on February 27th, 2019], 2008.
       13   Paul Tombeur. Thesaurus formarum totius Latinitatis: a Plauto usque ad saeculum XXum.
            Turnhout: Brepols, 1998.