=Paper=
{{Paper
|id=Vol-3602/paper1
|storemode=property
|title=Modelling Usage Information in a Legacy Dictionary: From TEI Lex-0 to Ontolex-Lemon
|pdfUrl=https://ceur-ws.org/Vol-3602/paper1.pdf
|volume=Vol-3602
|authors=Bruno Almeida,Rute Costa,Ana Salgado,Margarida Ramos,Laurent Romary,Fahad Khan,Sara Carvalho,Mohamed Khemakhem,Raquel Silva,Toma Tasovac
|dblpUrl=https://dblp.org/rec/conf/comhum/AlmeidaCSRRKCKS22
}}
==Modelling Usage Information in a Legacy Dictionary: From TEI Lex-0 to Ontolex-Lemon==
Modelling Usage Information in a Legacy Dictionary: From TEI Lex-0 to Ontolex-Lemon Bruno Almeida1,∗,† , Rute Costa1,† , Ana Salgado1,2,† , Margarida Ramos1,† , Laurent Romary3,† , Fahad Khan4,† , Sara Carvalho1,5,† , Mohamed Khemakhem6,† , Raquel Silva1,† and Toma Tasovac7,† 1 NOVA CLUNL - Centro de Linguística da Universidade Nova de Lisboa, Portugal 2 ACL - Academia das Ciências de Lisboa, Portugal 3 Inria - Institut national de recherche en sciences et technologies du numérique, France 4 ILC-CNR - Istituto di Linguistica Computazionale ”Antonio Zampolli”, Italy 5 CLLC - Centro de Línguas, Literaturas e Culturas da Universidade de Aveiro, Portugal 6 ArcaScience, France 7 BCDH - Belgrade Center for Digital Humanities, Serbia Abstract This paper describes ongoing work in the modelling of usage information in the context of the MORDigital project. The latter is based on the encoding and publication as linked data of Diccionario da Lingua Portugueza, a Portuguese legacy dictionary authored by António de Morais Silva, whose first edition was published in 1789. In this paper, we focus on the TEI Lex-0 encoding and Ontolex-Lemon modelling of lexicographic articles from the Morais Silva dictionary that feature usage information. The approach described in this paper should be reusable for other projects involving the encoding and linked data publication of legacy dictionaries. Keywords legacy dictionaries, usage information, lexicography, digital humanities COMHUM 2022: Workshop on Computational Methods in the Humanities, June 09–10, 2022, Lausanne, Switzerland ∗ Corresponding author. † These authors contributed equally. Envelope-Open brunoalmeida@fcsh.unl.pt (B. Almeida); rute.costa@fcsh.unl.pt (R. Costa); anasalgado@fcsh.unl.pt (A. Salgado); mvramos@fcsh.unl.pt (M. Ramos); laurent.romary@inria.fr (L. Romary); fahad.khan@ilc.cnr.it (F. Khan); sara. carvalho@ua.pt (S. Carvalho); mohamed.khemakhem@inria.fr (M. Khemakhem); raq.silva@fcsh.unl.pt (R. Silva); ttasovac@humanistika.org (T. Tasovac) Orcid 0000-0002-5777-5574 (B. Almeida); 0000-0002-3452-7228 (R. Costa); 0000-0002-6670-3564 (A. Salgado); 0000- 0001-7209-3806 (M. Ramos); 0000-0002-0756-0508 (L. Romary); 0000-0002-1551-7438 (F. Khan); 0000-0002-7501- 5405 (S. Carvalho); 0000-0003-3529-2990 (M. Khemakhem); 0000-0002-0505-4863 (R. Silva); 0000-0002-3919-993X (T. Tasovac) © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 5 CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings Bruno Almeida et al. CEUR Workshop Proceedings 5–21 1. Introduction The publication of Diccionario da Lingua Portugueza in 1789, authored by António de Morais Silva, marks the beginning of contemporary Portuguese lexicography, following the model set by several modern language dictionaries published in Europe in the 17th and 18th centuries [1, 2]. As the first Portuguese monolingual dictionary, it had a fundamental role in the standardisation of this language, and constitutes a reference for studying the evolution of the Portuguese lexicon [3]. The first edition of the dictionary had two volumes (Vol. 1, 752 p. and Vol. 2, 541 p.). Morais directly oversaw the 2nd and 3rd editions (published, respectively, in 1813 and 1823). This work was greatly revised and updated over the years, culminating in the 10th edition, which was published in 12 volumes from 1949 to 1959. The MorDigital project1 aims at digitising and publishing in open access the first three editions of the dictionary by Morais [4]. Our methodology involves the reuse of digitised versions of the dictionary, available in the public domain as PDF files. The latter are currently undergoing a re-OCRisation process to ensure the quality of the final output of the project. The digitised versions of the dictionary will be structured by means of several open standards for encoding and modelling lexical and dictionary data, which will facilitate interoperability with existing systems and datasets. The encoding of the dictionary’s editions will be carried out in TEI Lex-0 [5], a baseline XML encoding for machine-readable dictionaries based on the guidelines of the Text Encoding Initiative (TEI). The TEI Lex-0 encoding of the Morais Silva dictionary will be the basis for a LMF (Lexical Markup Framework) version, which should be facilitated by the ongoing convergence between TEI and the LMF standard [6]. The TEI Lex-0 encoding of the Morais Silva dictionary will be further transformed to RDF based on Ontolex-Lemon [7], a model originally developed for enriching ontologies with lexical information, which has become a de facto standard for publishing lexical resources as linked data [8]. The recently developed lexicography module, lexicog [9], facilitates the application of Ontolex to dictionary data. An XSLT-based tool, tei2ontolex 2 , will be used for the conversion process. Examples such as those presented in this paper will be the basis for a wider coverage of features of this tool. While the work shown in this paper is focussed on digital lexicography, specifically on the retrodigitisation of legacy dictionaries, technologies such as TEI and Ontolex are relevant in many other domains of the digital humanities involving text encoding, analysis, and publishing, including discourse analysis and other fields of linguistics, digital literary studies, cultural heritage, and digital archives. TEI, for instance, includes several communities of practice whose activity is centred on encoding text in a standardised, flexible, and interoperable framework. This paper could, therefore, be useful for scholars in those communities, especially for those whose work also involves applying linked data models for publishing digital humanities resources. 1 https://mordigital.fcsh.unl.pt/ 2 https://github.com/elexis-eu/tei2ontolex 6 Bruno Almeida et al. CEUR Workshop Proceedings 5–21 1.1. Related work Relevant work for the approach described in this paper has been carried out recently. Much of this work is focussed on relating TEI Lex-0 with Ontolex-Lemon, augmented with other ontologies, for the linked data modelling of lexicographic resources, emphasising applications in retrodigitisation projects3 . The importance of the above-mentioned formats and models is made clear in the context of the ELEXIS project4 , a European infrastructure for interoperable lexicographic resources, in which TEI Lex-0 and Ontolex-Lemon are two of the main formats for publishing and interlinking dictionary data [12]. Khan and Salgado [13] describe a novel approach to the modelling and publication of lexico- graphic resources as linked data. This approach consists of using Ontolex-Lemon and lexicog in conjunction with the CIDOC-CRM aligned FRBRoo ontology for representing different levels of description of lexicographic resources (i.e., work, expression, manifestation) corresponding to the different views of dictionaries explained in the TEI Guidelines [14], namely typographical (the layout of the pages), editorial (the text of the dictionary) and lexical (the conceptual and linguistic content of the dictionary). The work carried out in this paper pertains to the lexical view of the Morais dictionary, in which elements from Ontolex and lexicog are used to model usage information following an interoperable approach to that of Khan and Salgado [13] (see Section 6 of this paper). In addition to the work described above, current research has focussed on lexicography and digital humanities, including the application of ontologies and knowledge organisation. Costa et al. [15] show how domain labels can be modelled through an OWL ontology in the medical and health sciences (OntoDom-Lab-Med5 ), whose classes can be applied to the semantic annotation of usage information in TEI Lex-0 encoded dictionaries. This can be done solely with TEI elements and attributes, including the ontology class URI within the usage information element, or with the XML Linking Language (XLink), which also allows to describe the role played by ontology class URI, providing more complex domain information. In turn, Costa et al. [16] show how SKOS (Simple Knowledge Organization System), a W3C recommendation for modelling knowledge organisation schemes, can be employed in digital humanities projects for modelling linguistic/lexicographic categories represented in dictionaries’ lists of abbreviations (e.g., part of speech, grammatical gender, register). SKOS modelling, for knowledge organisation purposes, is shown to be complementary to the TEI Lex-0 encoding of dictionary articles. Salgado et al. [17] further highlight the importance of domain label modelling through terminological methods, namely by structuring domain labels. The resulting taxonomies or classifications of domains can be included directly within theelement of TEI- encoded dictionaries, whose categories can be linked to individual dictionary articles through the TEI usage information element, while still retaining the text values that occur in the dictionary articles for human readability purposes. 3 See for example [10, 11] 4 https://elex.is/ 5 https://doi.org/10.34619/emw4-ax6o 7 Bruno Almeida et al. CEUR Workshop Proceedings 5–21 Section 5 of this paper illustrates the simpler option, as described in Costa et al. [15], for linking the TEI encoding of usage information to OntoDomLab-Med and to the MorDigital Domain Classification. The latter, described in Section 4, follows notions laid out in Costa et al. [16] with regard to the complementarity between TEI encoding and SKOS modelling, and Salgado et al. [17] regarding the application of terminological methods in lexicography for structuring domain labels. 2. Background 2.1. Usage information in lexicography In lexicographic theory, usage or diasystematic6 information is understood as a set of constraints or restrictions on the use of words, or their senses, to certain contexts or to a subset of language users (e.g., [19, 20]). Dictionaries traditionally include usage information in the lexicographical articles as labels (often abbreviated), or in more verbose forms, such as notes or as part of the lexicographic definitions themselves. As Svensén notes, diasystematic marking in lexicographic articles implies that “a certain lexical item deviates in a certain respect from the main bulk of items described in a dictionary” [20, p. 315]. This notion of deviation from the lexicon of standard varieties of languages is, therefore, the cornerstone of usage marking in dictionaries. As noted by Salgado et al., usage labels are devices whose simplicity is only apparent, since they “often conceal the complexity of dynamic sociolinguistic, cultural, and ideological pro- cesses that they are meant to illustrate” [21, p. 134]. In a digital humanities context, such as the retrodigitisation of dictionaries, labels are also a challenge for the interoperability of lexicographic datasets from different sources, ranging over different cultures, languages, and periods. Indeed, since at least the 17th century, lexicographers have included labelling/marking in dictionaries, describing a wide variety of usage information, whose treatment in theoretical and practical lexicography has not always been consistent [22]. There have been several surveys of usage information in theoretical and practical lexicography, such as Ptaszyński [22] and Vrbinc and Vrbinc [23]. A more recent and comprehensive survey has been carried out by Salgado et al. [21]. These studies note the difficulties caused by the variety, and partial overlapping, of usage information types put forward by lexicographers, and the different terminology employed by them. Hausmann [24] put forward the most comprehensive classification, with 11 types of usage information. This classification was later adopted by Bergenholtz and Tarp [25] and Svensén [20]. Landau [19], whose manual was first published in the same year as Hausmann’s proposal, distinguished 9 types of usage information. In a later study, Milroy and Milroy [26] included 5 types of usage information, distinguishing ‘group labels’, which pertain to a subset of language users, from ‘register labels’, pertaining to specific social and communicative contexts. Jackson [27] put forward a classification including 7 types of usage information. Atkins and Rundell [28] considered 9 types of usage information, which they call ‘linguistic labels’. 6 The term ‘diasystem’, originating from dialectology [18], designates a general language system encompassing several dialects. In the context of lexicography, the term ‘diasystematic’ applies to the marking of lexical items whose usage deviates from that of the lexicon of the standard variety of a language. 8 Table 1: Usage information typologies in lexicographic literature Hausmann Milroy, J. and Landau (2001) Jackson (2002) Atkins and Run- TEI Lex-0 usage Criterion Examples (1989) adopted Milroy, L. dell (2008) type by others (1990) diachronic temporal currency/temporality history time temporal Time archaic, old diatopic geographical regional/geographi- dialect region/dialect geographic Place AmE., dial. cal variation diaintegrative — — — — hint Nationality Latin, English diamedial — style, functional vari- — — hint Medium spoken ety/register diastratic — restricted or taboo status slang and jar- socioCultural Sociocultural slang, vulgar, for- sexual scatological us- gon/offensive mal Bruno Almeida et al. CEUR Workshop Proceedings age and slang terms 9 diaphasic register style, functional vari- formality register socioCultural Formality slang, vulgar, for- ety/register mal diatextual — style, functional vari- — style textType Text type bibl., poet., ad- ety/register min., journalese diatechnical field technical or spe- topic or field domains domain Speciality Med., Biol., Phys. cialised terminology diafrequential frequency — — — frequency Frequency rare, occas. diaevaluative — insult/style, func- effect attitude attitude Attitude derog., euph. tional variety/regis- ter dianormative — status or cultural disputed usage — normativity Normativity non-standard, in- level correct — — — — — meaningType Meaning fig., lit. 5–21 Bruno Almeida et al. CEUR Workshop Proceedings 5–21 Table 1 shows a systematisation of the usage types considered in the above-mentioned classifications, based on the survey carried out by Salgado et al. [21]. 2.2. TEI P5 and TEI Lex-0 The Text Encoding Initiative, or TEI, is an international organisation with a long history in the development of guidelines, and associated schemas, for encoding machine-readable text in social sciences and humanities. The current release of the guidelines, TEI P5, include a module (i.e., a set of XML elements and attributes) for encoding dictionaries and other lexical resources, e.g., glossaries and word lists included in other documents [14, 9: Dictionaries]. The common characteristic of these resources is that they consist of entries/articles describing lexical items in a language (or languages). Since the characteristics of these resources may vary widely, TEI P5 includes the element for encoding conventional dictionary articles, the element for encoding unstructured entries in generic lexical resources, and the element for grouping several lexical entries. In a straightforward TEI encoding of print dictionaries, the main body of text (encoded through the element) should include a set of elements, corresponding to the dictionary’s articles. Among other possibilities, each element may contain: • Information about written and/or spoken forms of the headword (through the