From TBX to Ontolex-Lemon: Issues and Desiderata
Andrea Bellandi1,† , Giorgio Maria Di Nunzio2,† , Silvia Piccini1,† and Federica Vezzani3,†
1
  Istituto di Linguistica Computazionale “A. Zampolli”, Area della Ricerca CNR di Pisa, Via Giuseppe Moruzzi, 1, 56124
Pisa, Italy
2
  Department of Information Engineering, University of Padova, Via Gradenigo 6/b, 35131 Padova, Italy
3
  Department of Linguistic and Literary Studies, University of Padova, Via Elisabetta Vendramini, 13 35137 Padova, Italy


                                         Abstract
                                         The terminology community has shown a growing interest in the formats for designing and implementing
                                         multilingual terminological resources in order to favor their interoperability and reuse. In this paper, we
                                         focus specifically on two data models: TBX and Ontolex-lemon. In particular, we aim at undertaking an
                                         initial requirement analysis in order to build a converter for the latest versions of these two formats. We
                                         will focus on the theoretical and implementational implications that a transition from a concept-oriented
                                         structure (such as that of TBX) to a sense-centered organisation (such as that of Ontolex-Lemon) entails.

                                         Keywords
                                         TermBase eXchange, Ontolex-lemon, terminological resources Linked Open Data


1. Introduction
In recent years, Linked Data (henceforth LD) has been shown to be as one of the most promising
approaches for descriptive and summarizing metadata for representing and connecting research
results [1]. In this respect, the terminology community has shown growing interest in publishing
resources as LD to avoid a silos-based approach in information management and ensure the
interoperability and the reuse of terminological datasets, according to the FAIR policies [2].
Consequently, besides the ISO standard 30042: 2019 TBX (TermBase eXchange)1 – an XML-
based family of terminology exchange formats compliant with the Terminological Markup
Framework (TMF - ISO 16642: 2017)2 – the data model Ontolex-Lemon [3] is gaining ground
among terminologists, as being the de facto standard for the representation of lexical data on
the Semantic Web as LD (inter al. see [4, 5, 6]).3 Numerous methods and approaches have

2nd International Conference on "Multilingual digital terminology today. Design, representation formats and manage-
ment systems" (MDTT) 2023, June 29–30, 2023, Lisbon, Portugal
†
  These authors contributed equally.
" andrea.bellandi@ilc.cnr.it (A. Bellandi); dinunzio@dei.unipd.it (G. M. Di Nunzio); silvia.piccini@ilc.cnr.it
(S. Piccini); dinunzio@dei.unipd.it (F. Vezzani)
~ https://www.dei.unipd.it/~dinunzio/ (G. M. Di Nunzio); https://www.dei.unipd.it/~vezzanif/ (F. Vezzani)
 0000-0002-1900-5616 (A. Bellandi); 0000-0001-7116-9338 (G. M. Di Nunzio); 0000-0002-2584-0191 (S. Piccini);
0000-0003-2240-6127 (F. Vezzani)
                                       © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
            CEUR Workshop Proceedings (CEUR-WS.org)
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073


1
  https://www.iso.org/standard/62510.html
2
  https://www.iso.org/standard/56063.html
3
  Another data model widely used in the world of semantic web for sharing and publishing lexical information as LD
  is the Simple Knowledge Organization System (SKOS). The latter features the same concept-centric structure as
been proposed to convert terminological data from TBX to Ontolex-Lemon ([8, 9]), so that
they can become part of the linguistic Linked Data ecosystem ([10]). Guidelines have been
developed as a part of the LIDER project, to map these two data models and the new paradigm
Term-à-LLOD ([11]), based on a virtualization approach, has been proposed to transform and
publish standardised terminological resources as LD. In addition, a recent initiative (https:
//www.w3.org/community/ontolex/wiki/Terminology) foresees providing Ontolex-Lemon with
a module specifically dedicated to terminology, to represent the information usually contained
in traditional terminological resources and thesauri. A further step towards the realization
of an ecosystem of interlinked terminological datasets is the work by [8], where the authors
propose a conversion system – TBX2RDF – based on the Ontolex-Lemon model as well as a
series of best practices to transform terminologies from TBX into the LD format. Nevertheless,
it is worth emphasizing that converting a TBX format to a LD structure is not merely a question
of switching from an XML-based to an RDF-based data structure. Any conversion inevitably
involves a change of perspective and thus needs for a theoretical reflection as [12] pointed out. In
the light of this, the aim of this paper is to undertake an initial analysis of the requirements that a
converter should satisfy, with a particular focus on the theoretical implications that a transition
from a concept-oriented structure (such as that of TBX) to a sense-centered organisation (such
as that of Ontolex-Lemon) entails. The ultimate goal is to build a new converter that will have
two fundamental characteristics. First of all, unlike TBX2RDF designed to handle input files in
the older version of TBX (ISO 30042: 2008)4 and return as output files in the former version of
lemon ([13]), the new converter will work with the updated versions of the two data models (ISO
30042: 2019 and Ontolex-Lemon). Secondly, the converter will be conceived as an interactive
tool, where the terminologist will take an active role in the conversion process, making decisions
regarding complex aspects such as variation, polysemy, sense-concept relation, etc.


2. Testing the TBX2RDF converter: analysis of TBX fragments
Based on the idea of transforming terminological resources from TBX to RDF, we first experi-
mented with the TBX2RDF online converter service.5 As previously mentioned, the converter
works with the older version of the TBX data model described in the ISO 30042 standard dated
2008 (1st edition).6 For this reason, we submitted a set of three TBX instances as input structured
according to the obsolete data model. Figures 1, 2 and 3 (see Appendix) show three fragments we
used to test the converter. For each of these instances, we have also formulated the respective
TBX data model structured according to the current version of the ISO 30042 standard dated
2019 (2nd edition).7 Focusing on the terminological data collected in the instances:

  TBX (for a comparison between TBX and SKOS see [7].
4
  https://www.iso.org/standard/45797.html
5
  http://tbx2rdf.lider-project.eu/converter/index.html
6
  As specified in ISO 30042 (2008, vi): "This version of TBX is an update of a version that was published by the
  Localization Industry Standards Association (LISA) in 2002".
7
  At the following anonymized repository, we provide i) the three complete instances of examples A, B and C illustrated
  in figures 1, 2 and 3; and ii) the same three examples modeled after the 2019 version of TBX according to the TBX-
  Core, TBX-Min and TBX-Basic public dialects: https://anonymous.4open.science/r/MDTT2023-04D7/. For a detailed
  description of the three public dialects and the data category style adopted (Data Category as Tag - DCT), please refer
    • Example A (figure 1) shows the representation of two concepts (C1 and C2) both verbalized
      in three languages (English, French and Italian). For each language there are one or more
      terms that designate the concept in question. For example, the English terms "goods
      vehicle" and "utility vehicle" are represented as synonyms since they designate the same
      concept C1. Other minimum information for the representation of the terminological
      entry has been added, for example 1) the data category /date/ indicates the time reference
      relating to the moment of creation of the entry; 2) the data category /note/, with reference
      – in both cases – to the French terms designating the two concepts, provides additional
      textual information about the use of the terms.
    • Example B (figure 2) shows the representation of a concept (C1) verbalized again in the
      three working languages: English, French and Italian. As in the previous case, each
      language registers one or more equivalent terms designating the concept. Compared to
      the previous example, this instance is richer in terms of provided terminological data.
      The concept C1 is accompanied, for example, by the data category related to the /subject
      field/ of analysis: in this case e-mobility. Furthermore, for each term, we provide data
      categories related to: /part of speech/, /administrative status/, and the /external source/.
      Also in this case, term notes provide additional information relating, for instance, to the
      phenomenon of diatopic variation: see, for example, the case of the term "car sharing"
      reported as a terminological variant in AU, NZ, CA, TH, and US.
    • Example C (figure 3) is the richest in terms of data categories entered to describe the
      concept, language, and term elements. The entry illustrates the representation of two
      concepts C1 and C2, the latter being collapsed for space reasons. The TBX fragment
      shows the concept C1 verbalized in English and designated by two equivalent terms:
      "neighborhood electric vehicle" and the acronym "NEV". In addition to the data categories
      described in the previous examples, this instance contains 1) information related to
      /transaction type/ and /responsibility/ at the concept level; 2) the /definition/ of the
      concept expressed in natural language and the /external source/ of the definition, at the
      language level; 3), the data category related to the /term type/ which specifies whether
      the term is a full form or an acronym, at the term level.


3. What the New Converter Converter Should Look Like
As emerged from the first experiments conducted with the TBX files described above, the
converter TBX2RDF uses an ad hoc vocabulary (with namespace tbx:), created by the University
of Bielefeld and the Polytechnic University of Madrid as a part of the Lider project to map some
TBX elements to Ontolex-Lemon. Specifically,
   1. new entities are introduced to model TBX elements that have no counterpart in OntoLex-
      Lemon, such as tbx:adminInfo and tbx:administrativeStatus, referring respectively to TBX
      tags <adminInfo> and <administrativeStatus>;


to: https://www.tbxinfo.net/tbx-dialects/ and https://www.tbxinfo.net/dca-v-dct/. The instances have been manually
reformulated but a conversion service is offered on the TBX.info website: https://www.tbxinfo.net/tbx-updater/.
    2. the lexicon-semantic relations (for example lexinfo8 :antonym, lexinfo:meronym), which
       in OntoLex-Lemon link lexical senses, are redefined as relations between concepts (for
       example tbx:antonymConcept, tbx:broaderConceptPartitive);
    3. two new classes are introduced: the class tbx:TerminologicalConcept representing a
       language-independent concept denoted by the term and defined as a skos:Concept; the
       class tbx:Term, defined as a subclass of ontolex:LexicalEntry, representing a term as the
       language-specific realization of a tbx:TerminologicalConcept. It is worth underlining that
       the tbx:Term class does not seem to be used by the converter, as each term T is translated
       into a lexical entry having T as canonical form, and a lexical sense, which is linked to the
       concept defined as tbx:TerminologicalConcept.
   Other entities were redefined, even if already present in other vocabularies used by Ontolex-
Lemon: so for example the grammatical categories already defined in lexinfo, such as lex-
info:noun, lexinfo:verb, etc., are redefined as tbx:noun, tbx:verb, etc.
   Some choices made by [8] may not satisfy all terminologists who may feel the need to
reflect and choose how to convert data from one format to another. We report below some
considerations in this regard and some proposals for the future converter:

     • lexicographical vs. terminological view. As previously underlined, conversion con-
       sists in a change of perspective: a purely terminological vision (TBX) is transformed
       into a lexicographic standpoint (Ontolex-Lemon), where the conceptual dimension is not
       considered and, conversely, sense acquires a central role. According to the core of the
       model (Ontolex), indeed, each lexical entry – be it an affix, a word or a multiword – is an
       instance of the class Lexical Entry and is associated to its morphological realizations (class
       Form) through the relation ontolex:lexical form as well as to its sense(s) (class Lexical
       Sense) by means of the ontolex:sense property. Lexical sense is therefore conceived as the
       reification of the ontolex:denotes property linking the lexical entry and the ontological
       concept, and thus additional properties can be specified, such as context, register, or do-
       main etc. The concept, on the other hand, is viewed as an extralinguistic entity designated
       by a sense, and thus formalised in an ontology external to the model. Sense and concept
       represent two distinct entities, in line with those who believe that flattening the sense
       on the concept means forgetting the linguistic dimension that the term, as a lexical unit,
       possesses.9 Not all terminologists might agree with this view. To satisfy a more traditional
       view of terminology, the model Ontolex-Lemon provides for the possibility of bypassing
       the lexical sense class and directly linking the lexical entry to the concept it denotes. In
       the converter that we intend to create, it is up to terminologists to choose the approach
       they deem most suitable, also according to the task in which the converted resource will
       be exploited. Concerning the concepts, OntoLex-Lemon considers them as extra linguistic
       entities, and they can be represented by predicates that have denotational semantics in
       some formal logical system. This means that the model is agnostic w.r.t. the specific
       ontology language used for representing concepts. The choice of forcing the usage of
8
  The OntoLex-Lemon model recommends the use of the ontology LexInfo (https://lexinfo.net/), which serves as a
  linguistic category registry. TBX2RDF uses an out-of-date version of LexInfo.
9
  Cfr. [14, p. 55]: "le concept est le signifié d’un mot dont on décide de négliger la dimension linguistique". See inter
  al. [15, 16]
          SKOS as a knowledge organization framework should be too restrictive. We propose a
          converter that allows the terminologist to decide if using SKOS, OWL, or mapping the
          concepts to a given ontology.
        • ontology reuse. The LD paradigm strongly encourages the reuse of existing vocabularies.
          According to this principle, the converter should make it possible to decide which data
          categories to use. As said above, many linguistic categories defined in the lexinfo ontology
          have been redefined by the tbx: vocabulary. Referring to figure 2, for example the
          part of speech of the term "car sharing" is converted by the triple «:car_sharing_lex,
          tbx:partOfSpeech, tbx:noun». Participating in such a decision may be required by some
          terminologists.10
        • deductive rules. The structure of the TBX file has some implicit relations among terms
          that are lost in the conversion from TBX to OntoLex-Lemon. The most important one
          is the information about synonymity among terms. Terms that are described in the
          same termEntry in a LangSet are synonyms in that language. This type of relation is
          not captured by TBX2RDF. The terms "store" and "storage" in figure 1, for example,
          have been converted as two different lexical entries whose senses refer to the same
          concept C2, but no synonym relation is explicitly stated. Another important relation,
          especially in a multilingual resource, is the equivalence of a term in different languages.
          In TBX, terms that are in different language sections that are grouped together in the
          same concept entry are equivalent, as shown, for example, in figure 2 with the terms
          "car sharing"@en and "autopartage"@fr. TBX2RDF does not represent any equivalence
          relation between these terms. Conversely, the new converter could suggest the use of
          the relations vartrans:translatableAs or lexinfo:translation in order to explicitly link the
          equivalent terms.
        • knowledge extraction. When TBX is chosen to describe terminological data, the work
          of the researcher is immediately constrained by the type of the selected dialect. Each
          dialect has its set of data categories to describe the entry for each concept; a larger set
          means a finer granularity of the description of the properties of each term (and concept).
          Nevertheless, there are situations where, given a TBX dialect, the terminographer does
          not have a specific data category at disposal to describe a particular behavior of the term.
          In those cases, the terminographer has only one choice available: to use the «note» field
          to store that information. TBX2RDF converts these important pieces of information as
          annotation properties, such as rdfs:comment or rdfs:label. In our opinion, a converter
          should try to process the unstructured text source with automatic methods to identify
          structured semantic information ([17]), and store it in the appropriate relation(s) in
          OntoLex-Lemon.
        • enriching the TBX. A subsequent step, after the knowledge extraction from unstructured
          notes, could be the enrichment of the original TBX with the new extracted information.
          In fact, if we are able to complement the semantic relationships that are stored (implicitly
          or explicitly) in the TBX file and produce a richer version of the original term record in
          OntoLex-Lemon during the conversion process, then, it would also be possible to give this
10
     TBX2RD seems to support this feature by allowing one to provide a list of mapping between TBX elements and
     the related linked data entities, as input.
      enhanced version as a feedback to expand the TBX structure. This is, in theory, a viable
      alternative; in practice, modifying the TBX structure in the latest ISO standard is easier
      if we use data categories already defined in DatCatInfo (https://datcatinfo.termweb.eu/).
      The introduction of a new data category would require, indeed, a bit more elaborated
      solution to document such a new category in a way required by the ISO standard itself.

  In light of the considerations asked above, we propose to create a new interactive and
configurable converter involving the terminologist during the conversion process. We are
going to present all the design and implementation details as well as the first prototype at the
conference.


4. Acknowledgments
This work has been carried out in the framework of agreement between Consiglio Nazionale
delle Ricerche – Istituto di Linguistica Computazionale and RUT Foundation. This work is
also part of the initiatives carried out by the Center for Studies in Computational Terminology
(CENTRICO) of the University of Padua and in the research directions of the Italian Common
Language Resources and Technology Infrastructure CLARIN-IT.


References
 [1] J. Frey, S. Hellmann, FAIR Linked Data - Towards a Linked Data Backbone for Users
     and Machines, in: Companion Proceedings of the Web Conference 2021, WWW ’21,
     Association for Computing Machinery, New York, NY, USA, 2021, pp. 431–435. URL:
     https://doi.org/10.1145/3442442.3451364. doi:10.1145/3442442.3451364.
 [2] M. D. Wilkinson, M. Dumontier, I. J. Aalbersberg, G. Appleton, M. Axton, A. Baak,
     N. Blomberg, J.-W. Boiten, L. B. da Silva Santos, P. E. Bourne, J. Bouwman, A. J. Brookes,
     T. Clark, M. Crosas, I. Dillo, O. Dumon, S. Edmunds, C. T. Evelo, R. Finkers, A. Gonzalez-
     Beltran, A. J. G. Gray, P. Groth, C. Goble, J. S. Grethe, J. Heringa, P. A. C. ’t Hoen, R. Hooft,
     T. Kuhn, R. Kok, J. Kok, S. J. Lusher, M. E. Martone, A. Mons, A. L. Packer, B. Persson,
     P. Rocca-Serra, M. Roos, R. van Schaik, S.-A. Sansone, E. Schultes, T. Sengstag, T. Slater,
     G. Strawn, M. A. Swertz, M. Thompson, J. van der Lei, E. van Mulligen, J. Velterop, A. Waag-
     meester, P. Wittenburg, K. Wolstencroft, J. Zhao, B. Mons, The FAIR Guiding Principles
     for scientific data management and stewardship, Scientific Data 3 (2016) 160018. URL:
     https://www.nature.com/articles/sdata201618. doi:10.1038/sdata.2016.18, number: 1
     Publisher: Nature Publishing Group.
 [3] J. McCrae, J. Bosque-Gil, J. Gracia, P. Buitelaar, P. Cimiano, The OntoLex-Lemon
     Model: Development and Applications, in: I. Kosem, C. Tiberius, M. Jakubíček,
     J. Kallas, S. Krek, V. Baisa (Eds.), Electronic lexicography in the 21st century: Lexicog-
     raphy from scratch. Proceedings of eLex 2017, Lexical Computing CZ s.r.o., Brno, 2017,
     pp. 587–597. URL: https://elex.link/elex2017/wp-content/uploads/2017/09/paper36.pdf,
     http://en.wikipedia.org/wiki/Galway; https://en.wikipedia.org/wiki/Leiden.
 [4] J. Bosque-Gil, J. Gracia, G. Aguado-de Cea, E. Montiel-Ponsoda, Applying the OntoLex
     Model to a Multilingual Terminological Resource, in: F. Gandon, C. Guéret, S. Villata,
     J. Breslin, C. Faron-Zucker, A. Zimmermann (Eds.), The Semantic Web: ESWC 2015 Satellite
     Events, Lecture Notes in Computer Science, Springer International Publishing, Cham, 2015,
     pp. 283–294. doi:10.1007/978-3-319-25639-9_43.
 [5] V. Rodriguez-Doncel, C. Santos, P. Casanovas, A. Gómez-Pérez, J. Gracia, A Linked
     Data Terminology for Copyright Based on Ontolex-Lemon, in: U. Pagallo, M. Palmirani,
     P. Casanovas, G. Sartor, S. Villata (Eds.), AI Approaches to the Complexity of Legal Systems,
     Lecture Notes in Computer Science, Springer International Publishing, Cham, 2018, pp.
     410–423. doi:10.1007/978-3-030-00178-0_28.
 [6] D. Vellutino, R. Maslias, F. Rossi, C. Mangiacapre, M. P. Montoro, Verso l’interoperabilità
     semantica di IATE. Studio preliminare sul lessico dei Fondi strutturali e d’Investimento
     Europei (Fondi SIE) (????). URL: https://www.diacronia.ro/en/indexing/details/A23887.
 [7] D. Reineke, L. Romary, Bridging the gap between SKOS and TBX, edition - Die
     Fachzeitschrift für Terminologie 19 (2019). URL: https://hal.inria.fr/hal-02398820, pub-
     lisher: Deutscher Terminologie-Tag e.V. (DTT).
 [8] P. Cimiano, J. P. McCrae, V. Rodríguez-Doncel, T. Gornostay, A. Gómez-Pérez, B. Siemoneit,
     A. Lagzdins, Linked terminologies: applying linked data principles to terminological
     resources, in: Proceedings of the eLex 2015 Conference, 2015, pp. 504–517.
 [9] E. Montiel-Ponsoda, J. Bosque-Gil, J. Gracia, G. A. de Cea, D. Vila-Suero, Towards the
     Integration of Multilingual Terminologies: an Example of a Linked Data Prototype., in:
     Terminology and Artificial Intelligence (TAI), 2015, pp. 205–206.
[10] C. Chiarcos, J. McCrae, P. Cimiano, C. Fellbaum, Towards Open Data for Linguistics:
     Linguistic Linked Data, in: A. Oltramari, P. Vossen, L. Qin, E. Hovy (Eds.), New Trends
     of Research in Ontologies and Lexical Resources: Ideas, Projects, Systems, Theory and
     Applications of Natural Language Processing, Springer, Berlin, Heidelberg, 2013, pp. 7–25.
     URL: https://doi.org/10.1007/978-3-642-31782-8_2. doi:10.1007/978-3-642-31782-8_
     2.
[11] M. P. di Buono, P. Cimiano, M. F. Elahi, F. Grimm, Terme-à-LLOD: Simplifying the
     Conversion and Hosting of Terminological Resources as Linked Data, in: Proceedings of
     the 7th Workshop on Linked Data in Linguistics (LDL-2020), European Language Resources
     Association, Marseille, France, 2020, pp. 28–35. URL: https://aclanthology.org/2020.ldl-1.5.
[12] S. Piccini, F. Vezzani, A. Bellandi, Entre TBX et Ontolex-Lemon : Quelles Nouvelles
     Perspectives en Terminologie? (poster), in: G. M. D. Nunzio, G. M. Henrot, M. T. Musacchio,
     F. Vezzani (Eds.), Proceedings of the 1st International Conference on Multilingual Digital
     Terminology Today, volume 3161 of CEUR Workshop Proceedings, CEUR, Padua, Italy, 2022.
     URL: https://ceur-ws.org/Vol-3161/#poster10, iSSN: 1613-0073.
[13] J. McCrae, G. Aguado-de Cea, P. Buitelaar, P. Cimiano, T. Declerck, A. Gómez-Pérez,
     J. Gracia, L. Hollink, E. Montiel-Ponsoda, D. Spohr, T. Wunner, Interchanging lexical
     resources on the Semantic Web, Language Resources and Evaluation 46 (2012) 701–719.
     URL: https://doi.org/10.1007/s10579-012-9182-3. doi:10.1007/s10579-012-9182-3.
[14] F. Rastier, Le terme : entre ontologie et linguistique, La banque de mots 7 (1995) 35–65.
     URL: http://www.revue-texto.net/Inedits/Rastier/Rastier_Terme.html.
[15] L. Depecker, Entre signe et concept : Éléments de terminologie générale, Sciences du
     langage, Presses Sorbonne Nouvelle, Paris, 2019. URL: http://books.openedition.org/psn/
     3388, code: Entre signe et concept : Éléments de terminologie générale Publication Title:
     Entre signe et concept : Éléments de terminologie générale Reporter: Entre signe et concept
     : Éléments de terminologie générale Series Title: Sciences du langage.
[16] M. Diki-Kidiri, Le signifié et le concept dans la dénomination, Meta : journal des traducteurs
     / Meta: Translators’ Journal 44 (1999) 573–581. URL: https://www.erudit.org/en/journals/
     meta/1999-v44-n4-meta165/002566ar/. doi:10.7202/002566ar, publisher: Les Presses
     de l’Université de Montréal.
[17] D. B. Claro, M. Souza, C. Castellã Xavier, L. Oliveira, Multilingual Open Informa-
     tion Extraction: Challenges and Opportunities, Information 10 (2019) 228. URL: https:
     //www.mdpi.com/2078-2489/10/7/228. doi:10.3390/info10070228, number: 7 Pub-
     lisher: Multidisciplinary Digital Publishing Institute.


A. TBX Examples
In this Appendix, we show portions of TBX instances that are available on GitHub.11


11
     https://anonymous.4open.science/r/MDTT2023-04D7/
Figure 1: Example A of a TBX instance
Figure 2: Example B of a TBX instance
Figure 3: Example C of a TBX instance