From TBX to Ontolex-Lemon: Issues and Desiderata Andrea Bellandi1,† , Giorgio Maria Di Nunzio2,† , Silvia Piccini1,† and Federica Vezzani3,† 1 Istituto di Linguistica Computazionale “A. Zampolli”, Area della Ricerca CNR di Pisa, Via Giuseppe Moruzzi, 1, 56124 Pisa, Italy 2 Department of Information Engineering, University of Padova, Via Gradenigo 6/b, 35131 Padova, Italy 3 Department of Linguistic and Literary Studies, University of Padova, Via Elisabetta Vendramini, 13 35137 Padova, Italy Abstract The terminology community has shown a growing interest in the formats for designing and implementing multilingual terminological resources in order to favor their interoperability and reuse. In this paper, we focus specifically on two data models: TBX and Ontolex-lemon. In particular, we aim at undertaking an initial requirement analysis in order to build a converter for the latest versions of these two formats. We will focus on the theoretical and implementational implications that a transition from a concept-oriented structure (such as that of TBX) to a sense-centered organisation (such as that of Ontolex-Lemon) entails. Keywords TermBase eXchange, Ontolex-lemon, terminological resources Linked Open Data 1. Introduction In recent years, Linked Data (henceforth LD) has been shown to be as one of the most promising approaches for descriptive and summarizing metadata for representing and connecting research results [1]. In this respect, the terminology community has shown growing interest in publishing resources as LD to avoid a silos-based approach in information management and ensure the interoperability and the reuse of terminological datasets, according to the FAIR policies [2]. Consequently, besides the ISO standard 30042: 2019 TBX (TermBase eXchange)1 – an XML- based family of terminology exchange formats compliant with the Terminological Markup Framework (TMF - ISO 16642: 2017)2 – the data model Ontolex-Lemon [3] is gaining ground among terminologists, as being the de facto standard for the representation of lexical data on the Semantic Web as LD (inter al. see [4, 5, 6]).3 Numerous methods and approaches have 2nd International Conference on "Multilingual digital terminology today. Design, representation formats and manage- ment systems" (MDTT) 2023, June 29–30, 2023, Lisbon, Portugal † These authors contributed equally. " andrea.bellandi@ilc.cnr.it (A. Bellandi); dinunzio@dei.unipd.it (G. M. Di Nunzio); silvia.piccini@ilc.cnr.it (S. Piccini); dinunzio@dei.unipd.it (F. Vezzani) ~ https://www.dei.unipd.it/~dinunzio/ (G. M. Di Nunzio); https://www.dei.unipd.it/~vezzanif/ (F. Vezzani)  0000-0002-1900-5616 (A. Bellandi); 0000-0001-7116-9338 (G. M. Di Nunzio); 0000-0002-2584-0191 (S. Piccini); 0000-0003-2240-6127 (F. Vezzani) © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop CEUR Workshop Proceedings (CEUR-WS.org) Proceedings http://ceur-ws.org ISSN 1613-0073 1 https://www.iso.org/standard/62510.html 2 https://www.iso.org/standard/56063.html 3 Another data model widely used in the world of semantic web for sharing and publishing lexical information as LD is the Simple Knowledge Organization System (SKOS). The latter features the same concept-centric structure as been proposed to convert terminological data from TBX to Ontolex-Lemon ([8, 9]), so that they can become part of the linguistic Linked Data ecosystem ([10]). Guidelines have been developed as a part of the LIDER project, to map these two data models and the new paradigm Term-à-LLOD ([11]), based on a virtualization approach, has been proposed to transform and publish standardised terminological resources as LD. In addition, a recent initiative (https: //www.w3.org/community/ontolex/wiki/Terminology) foresees providing Ontolex-Lemon with a module specifically dedicated to terminology, to represent the information usually contained in traditional terminological resources and thesauri. A further step towards the realization of an ecosystem of interlinked terminological datasets is the work by [8], where the authors propose a conversion system – TBX2RDF – based on the Ontolex-Lemon model as well as a series of best practices to transform terminologies from TBX into the LD format. Nevertheless, it is worth emphasizing that converting a TBX format to a LD structure is not merely a question of switching from an XML-based to an RDF-based data structure. Any conversion inevitably involves a change of perspective and thus needs for a theoretical reflection as [12] pointed out. In the light of this, the aim of this paper is to undertake an initial analysis of the requirements that a converter should satisfy, with a particular focus on the theoretical implications that a transition from a concept-oriented structure (such as that of TBX) to a sense-centered organisation (such as that of Ontolex-Lemon) entails. The ultimate goal is to build a new converter that will have two fundamental characteristics. First of all, unlike TBX2RDF designed to handle input files in the older version of TBX (ISO 30042: 2008)4 and return as output files in the former version of lemon ([13]), the new converter will work with the updated versions of the two data models (ISO 30042: 2019 and Ontolex-Lemon). Secondly, the converter will be conceived as an interactive tool, where the terminologist will take an active role in the conversion process, making decisions regarding complex aspects such as variation, polysemy, sense-concept relation, etc. 2. Testing the TBX2RDF converter: analysis of TBX fragments Based on the idea of transforming terminological resources from TBX to RDF, we first experi- mented with the TBX2RDF online converter service.5 As previously mentioned, the converter works with the older version of the TBX data model described in the ISO 30042 standard dated 2008 (1st edition).6 For this reason, we submitted a set of three TBX instances as input structured according to the obsolete data model. Figures 1, 2 and 3 (see Appendix) show three fragments we used to test the converter. For each of these instances, we have also formulated the respective TBX data model structured according to the current version of the ISO 30042 standard dated 2019 (2nd edition).7 Focusing on the terminological data collected in the instances: TBX (for a comparison between TBX and SKOS see [7]. 4 https://www.iso.org/standard/45797.html 5 http://tbx2rdf.lider-project.eu/converter/index.html 6 As specified in ISO 30042 (2008, vi): "This version of TBX is an update of a version that was published by the Localization Industry Standards Association (LISA) in 2002". 7 At the following anonymized repository, we provide i) the three complete instances of examples A, B and C illustrated in figures 1, 2 and 3; and ii) the same three examples modeled after the 2019 version of TBX according to the TBX- Core, TBX-Min and TBX-Basic public dialects: https://anonymous.4open.science/r/MDTT2023-04D7/. For a detailed description of the three public dialects and the data category style adopted (Data Category as Tag - DCT), please refer • Example A (figure 1) shows the representation of two concepts (C1 and C2) both verbalized in three languages (English, French and Italian). For each language there are one or more terms that designate the concept in question. For example, the English terms "goods vehicle" and "utility vehicle" are represented as synonyms since they designate the same concept C1. Other minimum information for the representation of the terminological entry has been added, for example 1) the data category /date/ indicates the time reference relating to the moment of creation of the entry; 2) the data category /note/, with reference – in both cases – to the French terms designating the two concepts, provides additional textual information about the use of the terms. • Example B (figure 2) shows the representation of a concept (C1) verbalized again in the three working languages: English, French and Italian. As in the previous case, each language registers one or more equivalent terms designating the concept. Compared to the previous example, this instance is richer in terms of provided terminological data. The concept C1 is accompanied, for example, by the data category related to the /subject field/ of analysis: in this case e-mobility. Furthermore, for each term, we provide data categories related to: /part of speech/, /administrative status/, and the /external source/. Also in this case, term notes provide additional information relating, for instance, to the phenomenon of diatopic variation: see, for example, the case of the term "car sharing" reported as a terminological variant in AU, NZ, CA, TH, and US. • Example C (figure 3) is the richest in terms of data categories entered to describe the concept, language, and term elements. The entry illustrates the representation of two concepts C1 and C2, the latter being collapsed for space reasons. The TBX fragment shows the concept C1 verbalized in English and designated by two equivalent terms: "neighborhood electric vehicle" and the acronym "NEV". In addition to the data categories described in the previous examples, this instance contains 1) information related to /transaction type/ and /responsibility/ at the concept level; 2) the /definition/ of the concept expressed in natural language and the /external source/ of the definition, at the language level; 3), the data category related to the /term type/ which specifies whether the term is a full form or an acronym, at the term level. 3. What the New Converter Converter Should Look Like As emerged from the first experiments conducted with the TBX files described above, the converter TBX2RDF uses an ad hoc vocabulary (with namespace tbx:), created by the University of Bielefeld and the Polytechnic University of Madrid as a part of the Lider project to map some TBX elements to Ontolex-Lemon. Specifically, 1. new entities are introduced to model TBX elements that have no counterpart in OntoLex- Lemon, such as tbx:adminInfo and tbx:administrativeStatus, referring respectively to TBX tags and ; to: https://www.tbxinfo.net/tbx-dialects/ and https://www.tbxinfo.net/dca-v-dct/. The instances have been manually reformulated but a conversion service is offered on the TBX.info website: https://www.tbxinfo.net/tbx-updater/. 2. the lexicon-semantic relations (for example lexinfo8 :antonym, lexinfo:meronym), which in OntoLex-Lemon link lexical senses, are redefined as relations between concepts (for example tbx:antonymConcept, tbx:broaderConceptPartitive); 3. two new classes are introduced: the class tbx:TerminologicalConcept representing a language-independent concept denoted by the term and defined as a skos:Concept; the class tbx:Term, defined as a subclass of ontolex:LexicalEntry, representing a term as the language-specific realization of a tbx:TerminologicalConcept. It is worth underlining that the tbx:Term class does not seem to be used by the converter, as each term T is translated into a lexical entry having T as canonical form, and a lexical sense, which is linked to the concept defined as tbx:TerminologicalConcept. Other entities were redefined, even if already present in other vocabularies used by Ontolex- Lemon: so for example the grammatical categories already defined in lexinfo, such as lex- info:noun, lexinfo:verb, etc., are redefined as tbx:noun, tbx:verb, etc. Some choices made by [8] may not satisfy all terminologists who may feel the need to reflect and choose how to convert data from one format to another. We report below some considerations in this regard and some proposals for the future converter: • lexicographical vs. terminological view. As previously underlined, conversion con- sists in a change of perspective: a purely terminological vision (TBX) is transformed into a lexicographic standpoint (Ontolex-Lemon), where the conceptual dimension is not considered and, conversely, sense acquires a central role. According to the core of the model (Ontolex), indeed, each lexical entry – be it an affix, a word or a multiword – is an instance of the class Lexical Entry and is associated to its morphological realizations (class Form) through the relation ontolex:lexical form as well as to its sense(s) (class Lexical Sense) by means of the ontolex:sense property. Lexical sense is therefore conceived as the reification of the ontolex:denotes property linking the lexical entry and the ontological concept, and thus additional properties can be specified, such as context, register, or do- main etc. The concept, on the other hand, is viewed as an extralinguistic entity designated by a sense, and thus formalised in an ontology external to the model. Sense and concept represent two distinct entities, in line with those who believe that flattening the sense on the concept means forgetting the linguistic dimension that the term, as a lexical unit, possesses.9 Not all terminologists might agree with this view. To satisfy a more traditional view of terminology, the model Ontolex-Lemon provides for the possibility of bypassing the lexical sense class and directly linking the lexical entry to the concept it denotes. In the converter that we intend to create, it is up to terminologists to choose the approach they deem most suitable, also according to the task in which the converted resource will be exploited. Concerning the concepts, OntoLex-Lemon considers them as extra linguistic entities, and they can be represented by predicates that have denotational semantics in some formal logical system. This means that the model is agnostic w.r.t. the specific ontology language used for representing concepts. The choice of forcing the usage of 8 The OntoLex-Lemon model recommends the use of the ontology LexInfo (https://lexinfo.net/), which serves as a linguistic category registry. TBX2RDF uses an out-of-date version of LexInfo. 9 Cfr. [14, p. 55]: "le concept est le signifié d’un mot dont on décide de négliger la dimension linguistique". See inter al. [15, 16] SKOS as a knowledge organization framework should be too restrictive. We propose a converter that allows the terminologist to decide if using SKOS, OWL, or mapping the concepts to a given ontology. • ontology reuse. The LD paradigm strongly encourages the reuse of existing vocabularies. According to this principle, the converter should make it possible to decide which data categories to use. As said above, many linguistic categories defined in the lexinfo ontology have been redefined by the tbx: vocabulary. Referring to figure 2, for example the part of speech of the term "car sharing" is converted by the triple «:car_sharing_lex, tbx:partOfSpeech, tbx:noun». Participating in such a decision may be required by some terminologists.10 • deductive rules. The structure of the TBX file has some implicit relations among terms that are lost in the conversion from TBX to OntoLex-Lemon. The most important one is the information about synonymity among terms. Terms that are described in the same termEntry in a LangSet are synonyms in that language. This type of relation is not captured by TBX2RDF. The terms "store" and "storage" in figure 1, for example, have been converted as two different lexical entries whose senses refer to the same concept C2, but no synonym relation is explicitly stated. Another important relation, especially in a multilingual resource, is the equivalence of a term in different languages. In TBX, terms that are in different language sections that are grouped together in the same concept entry are equivalent, as shown, for example, in figure 2 with the terms "car sharing"@en and "autopartage"@fr. TBX2RDF does not represent any equivalence relation between these terms. Conversely, the new converter could suggest the use of the relations vartrans:translatableAs or lexinfo:translation in order to explicitly link the equivalent terms. • knowledge extraction. When TBX is chosen to describe terminological data, the work of the researcher is immediately constrained by the type of the selected dialect. Each dialect has its set of data categories to describe the entry for each concept; a larger set means a finer granularity of the description of the properties of each term (and concept). Nevertheless, there are situations where, given a TBX dialect, the terminographer does not have a specific data category at disposal to describe a particular behavior of the term. In those cases, the terminographer has only one choice available: to use the «note» field to store that information. TBX2RDF converts these important pieces of information as annotation properties, such as rdfs:comment or rdfs:label. In our opinion, a converter should try to process the unstructured text source with automatic methods to identify structured semantic information ([17]), and store it in the appropriate relation(s) in OntoLex-Lemon. • enriching the TBX. A subsequent step, after the knowledge extraction from unstructured notes, could be the enrichment of the original TBX with the new extracted information. In fact, if we are able to complement the semantic relationships that are stored (implicitly or explicitly) in the TBX file and produce a richer version of the original term record in OntoLex-Lemon during the conversion process, then, it would also be possible to give this 10 TBX2RD seems to support this feature by allowing one to provide a list of mapping between TBX elements and the related linked data entities, as input. enhanced version as a feedback to expand the TBX structure. This is, in theory, a viable alternative; in practice, modifying the TBX structure in the latest ISO standard is easier if we use data categories already defined in DatCatInfo (https://datcatinfo.termweb.eu/). The introduction of a new data category would require, indeed, a bit more elaborated solution to document such a new category in a way required by the ISO standard itself. In light of the considerations asked above, we propose to create a new interactive and configurable converter involving the terminologist during the conversion process. We are going to present all the design and implementation details as well as the first prototype at the conference. 4. Acknowledgments This work has been carried out in the framework of agreement between Consiglio Nazionale delle Ricerche – Istituto di Linguistica Computazionale and RUT Foundation. This work is also part of the initiatives carried out by the Center for Studies in Computational Terminology (CENTRICO) of the University of Padua and in the research directions of the Italian Common Language Resources and Technology Infrastructure CLARIN-IT. References [1] J. Frey, S. Hellmann, FAIR Linked Data - Towards a Linked Data Backbone for Users and Machines, in: Companion Proceedings of the Web Conference 2021, WWW ’21, Association for Computing Machinery, New York, NY, USA, 2021, pp. 431–435. URL: https://doi.org/10.1145/3442442.3451364. doi:10.1145/3442442.3451364. [2] M. D. Wilkinson, M. Dumontier, I. J. Aalbersberg, G. Appleton, M. Axton, A. Baak, N. Blomberg, J.-W. Boiten, L. B. da Silva Santos, P. E. Bourne, J. Bouwman, A. J. Brookes, T. Clark, M. Crosas, I. Dillo, O. Dumon, S. Edmunds, C. T. Evelo, R. Finkers, A. Gonzalez- Beltran, A. J. G. Gray, P. Groth, C. Goble, J. S. Grethe, J. Heringa, P. A. C. ’t Hoen, R. Hooft, T. Kuhn, R. Kok, J. Kok, S. J. Lusher, M. E. Martone, A. Mons, A. L. Packer, B. Persson, P. Rocca-Serra, M. Roos, R. van Schaik, S.-A. Sansone, E. Schultes, T. Sengstag, T. Slater, G. Strawn, M. A. Swertz, M. Thompson, J. van der Lei, E. van Mulligen, J. Velterop, A. Waag- meester, P. Wittenburg, K. Wolstencroft, J. Zhao, B. Mons, The FAIR Guiding Principles for scientific data management and stewardship, Scientific Data 3 (2016) 160018. URL: https://www.nature.com/articles/sdata201618. doi:10.1038/sdata.2016.18, number: 1 Publisher: Nature Publishing Group. [3] J. McCrae, J. Bosque-Gil, J. Gracia, P. Buitelaar, P. Cimiano, The OntoLex-Lemon Model: Development and Applications, in: I. Kosem, C. Tiberius, M. Jakubíček, J. Kallas, S. Krek, V. Baisa (Eds.), Electronic lexicography in the 21st century: Lexicog- raphy from scratch. Proceedings of eLex 2017, Lexical Computing CZ s.r.o., Brno, 2017, pp. 587–597. URL: https://elex.link/elex2017/wp-content/uploads/2017/09/paper36.pdf, http://en.wikipedia.org/wiki/Galway; https://en.wikipedia.org/wiki/Leiden. [4] J. Bosque-Gil, J. Gracia, G. Aguado-de Cea, E. Montiel-Ponsoda, Applying the OntoLex Model to a Multilingual Terminological Resource, in: F. Gandon, C. Guéret, S. Villata, J. Breslin, C. Faron-Zucker, A. Zimmermann (Eds.), The Semantic Web: ESWC 2015 Satellite Events, Lecture Notes in Computer Science, Springer International Publishing, Cham, 2015, pp. 283–294. doi:10.1007/978-3-319-25639-9_43. [5] V. Rodriguez-Doncel, C. Santos, P. Casanovas, A. Gómez-Pérez, J. Gracia, A Linked Data Terminology for Copyright Based on Ontolex-Lemon, in: U. Pagallo, M. Palmirani, P. Casanovas, G. Sartor, S. Villata (Eds.), AI Approaches to the Complexity of Legal Systems, Lecture Notes in Computer Science, Springer International Publishing, Cham, 2018, pp. 410–423. doi:10.1007/978-3-030-00178-0_28. [6] D. Vellutino, R. Maslias, F. Rossi, C. Mangiacapre, M. P. Montoro, Verso l’interoperabilità semantica di IATE. Studio preliminare sul lessico dei Fondi strutturali e d’Investimento Europei (Fondi SIE) (????). URL: https://www.diacronia.ro/en/indexing/details/A23887. [7] D. Reineke, L. Romary, Bridging the gap between SKOS and TBX, edition - Die Fachzeitschrift für Terminologie 19 (2019). URL: https://hal.inria.fr/hal-02398820, pub- lisher: Deutscher Terminologie-Tag e.V. (DTT). [8] P. Cimiano, J. P. McCrae, V. Rodríguez-Doncel, T. Gornostay, A. Gómez-Pérez, B. Siemoneit, A. Lagzdins, Linked terminologies: applying linked data principles to terminological resources, in: Proceedings of the eLex 2015 Conference, 2015, pp. 504–517. [9] E. Montiel-Ponsoda, J. Bosque-Gil, J. Gracia, G. A. de Cea, D. Vila-Suero, Towards the Integration of Multilingual Terminologies: an Example of a Linked Data Prototype., in: Terminology and Artificial Intelligence (TAI), 2015, pp. 205–206. [10] C. Chiarcos, J. McCrae, P. Cimiano, C. Fellbaum, Towards Open Data for Linguistics: Linguistic Linked Data, in: A. Oltramari, P. Vossen, L. Qin, E. Hovy (Eds.), New Trends of Research in Ontologies and Lexical Resources: Ideas, Projects, Systems, Theory and Applications of Natural Language Processing, Springer, Berlin, Heidelberg, 2013, pp. 7–25. URL: https://doi.org/10.1007/978-3-642-31782-8_2. doi:10.1007/978-3-642-31782-8_ 2. [11] M. P. di Buono, P. Cimiano, M. F. Elahi, F. Grimm, Terme-à-LLOD: Simplifying the Conversion and Hosting of Terminological Resources as Linked Data, in: Proceedings of the 7th Workshop on Linked Data in Linguistics (LDL-2020), European Language Resources Association, Marseille, France, 2020, pp. 28–35. URL: https://aclanthology.org/2020.ldl-1.5. [12] S. Piccini, F. Vezzani, A. Bellandi, Entre TBX et Ontolex-Lemon : Quelles Nouvelles Perspectives en Terminologie? (poster), in: G. M. D. Nunzio, G. M. Henrot, M. T. Musacchio, F. Vezzani (Eds.), Proceedings of the 1st International Conference on Multilingual Digital Terminology Today, volume 3161 of CEUR Workshop Proceedings, CEUR, Padua, Italy, 2022. URL: https://ceur-ws.org/Vol-3161/#poster10, iSSN: 1613-0073. [13] J. McCrae, G. Aguado-de Cea, P. Buitelaar, P. Cimiano, T. Declerck, A. Gómez-Pérez, J. Gracia, L. Hollink, E. Montiel-Ponsoda, D. Spohr, T. Wunner, Interchanging lexical resources on the Semantic Web, Language Resources and Evaluation 46 (2012) 701–719. URL: https://doi.org/10.1007/s10579-012-9182-3. doi:10.1007/s10579-012-9182-3. [14] F. Rastier, Le terme : entre ontologie et linguistique, La banque de mots 7 (1995) 35–65. URL: http://www.revue-texto.net/Inedits/Rastier/Rastier_Terme.html. [15] L. Depecker, Entre signe et concept : Éléments de terminologie générale, Sciences du langage, Presses Sorbonne Nouvelle, Paris, 2019. URL: http://books.openedition.org/psn/ 3388, code: Entre signe et concept : Éléments de terminologie générale Publication Title: Entre signe et concept : Éléments de terminologie générale Reporter: Entre signe et concept : Éléments de terminologie générale Series Title: Sciences du langage. [16] M. Diki-Kidiri, Le signifié et le concept dans la dénomination, Meta : journal des traducteurs / Meta: Translators’ Journal 44 (1999) 573–581. URL: https://www.erudit.org/en/journals/ meta/1999-v44-n4-meta165/002566ar/. doi:10.7202/002566ar, publisher: Les Presses de l’Université de Montréal. [17] D. B. Claro, M. Souza, C. Castellã Xavier, L. Oliveira, Multilingual Open Informa- tion Extraction: Challenges and Opportunities, Information 10 (2019) 228. URL: https: //www.mdpi.com/2078-2489/10/7/228. doi:10.3390/info10070228, number: 7 Pub- lisher: Multidisciplinary Digital Publishing Institute. A. TBX Examples In this Appendix, we show portions of TBX instances that are available on GitHub.11 11 https://anonymous.4open.science/r/MDTT2023-04D7/ Figure 1: Example A of a TBX instance Figure 2: Example B of a TBX instance Figure 3: Example C of a TBX instance