New Terminological Approaches for New Heritages
                                and Corpora: The ITinHeritage Project
                                Caroline Djambian1 , Micaela Rossi2 , Giada D’Ippolito3 , Emrick Poncet4 and
                                Pierre Maret5
                                1
                                  Grenoble Alpes University, Lab GRESEC (Groupe de recherche sur les enjeux de la communication)
                                2
                                  Genoa University, Dipartimento di Lingue e culture moderne
                                3
                                  Genoa University, Dipartimento di Lingue e culture moderne
                                4
                                  Saint Etienne University and Grenoble Alpes Lab GRESEC (Groupe de recherche sur les enjeux de la communication)
                                5
                                  Saint Etienne University, Lab H. Curien


                                           Abstract
                                           Safeguarding Information Technology (IT) heritage is a matter of general interest. This is the aim of the
                                           ITinHeritage project, which takes a heritage-based approach to IT museums in France and around the
                                           world. The unprecedented study of this heritage leads us to understand how to perpetuate and mediate
                                           contemporary scientific and technical knowledge, of which data is the new heritage. They also constitute
                                           the new corpora, which means that terminology work needs to be rethought. The ITinHeritage project
                                           aims to develop an innovative approach to digital humanities by creating a corpus from the collections of
                                           IT museums, organised in the form of a knowledge graph, and by using methods and tools that combine
                                           traditional linguistic approaches and explorations in the computer sciences. Big data, AI and NLP are
                                           thus being used to explore, enhance and perpetuate their own heritage.

                                           Keywords
                                           Terminology, Multilingualism, Lexical extraction, Automatic Language Processing (ALP), Artificial
                                           Intelligence (AI), Knowledge graph, Ontology, Linked Open Data


                                1. An evolving subject for study: Information Technology (IT)
                                   heritage
                                Technoscience, the science embodied in technology, not only makes the world more intelligible,
                                but it also transforms and impacts the world in unprecedented proportions and at unprecedented
                                speed. Technoscience is the foundation of contemporary works of humanity. Information
                                Technologies (IT), defined by UNESCO (2023) as ”the set of tools and technological resources
                                that enable information to be transmitted, recorded, created, shared or exchanged”, are the
                                emblem of these technologies and their specific features. But to this day, even though IT
                                permeates and shapes our societies, this field remains poorly defined, because the rapid and
                                massive evolution has left little time for study and heritage development.

                                3rd International Conference on “Multilingual digital terminology today. Design, representation formats and
                                management systems” (MDTT) 2024, June 27-28, 2024, Granada, Spain.
                                Envelope-Open caroline.djambian@univ-grenoble-alpes.fr (C. Djambian); micaela.rossi@unige.it (M. Rossi);
                                giadadippolito30@gmail.com (G. D’Ippolito); emrick.poncet@univ-grenoble-alpes.fr (E. Poncet);
                                pierre.maret@univ-st-etienne.fr (P. Maret)
                                         © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
    Looking at this heritage is a crucial part of understanding today’s world and how society
has evolved since the second half of the twentieth century. But how can we define it, represent
it, promote it and preserve it in its rapid and proliferating evolution? Our approach, which is
resolutely heritage-based and interdisciplinary, takes as its starting point the IT museum spaces
that have initiated this heritage enhancement. The Musée des Arts et Métiers (MAM, France),
ACONIT (Grenoble, France), the London Science Museum (London, UK), the NAM-IP (Namur,
Belgium), the Museo degli strumenti per il calcolo (Pisa, Italy), the Home Computer Museum
(Helmond, Netherlands) and the HNF (Paderborn, Germany) are actively participating in our
expanding network. The ITinHeritage research project thus aims to conduct an epistemological
reflection on the issue of IT by placing at the heart of the question of heritage the history of
a societal mutation and the emergence of new knowledge in terms of representations of the
world.
    Our starting point is the as yet unexplored definition of this labile heritage, whose knowledge
is crystallised in various forms. Firstly, through the expression of its explicit knowledge, through
its physical objects and its digital objects: the digitised artefact and its documentation, the
software which represents 1/3 of certain collections and above all the data which forms the new
collections of science. Secondly, through the expression of the tacit knowledge of IT, which has
not yet been questioned or given a heritage status. This tacit knowledge is expressed at the
experiential level. Whereas ’pure knowledge’ [1] is ’exoteric’ [2], i.e. concretised in writing
or in an object, ’empirical knowledge’, the emperia or métis (Aristotle), can only be expressed
by its bearer in action. Gathering and transmitting this tacit, ”esoteric” knowledge is a real
challenge because, to be captured, emperia needs to be shaped into gestures and discourse. The
language of specialisation is the main expression of this knowledge, i.e. of experience of the
realities of the world [3].
    We are therefore placing language at the heart of the ITinHeritage project by combining
complementary and innovative approaches aimed at the interaction between terminology and
ontology. This link is not a new theme. From this theoretical perspective, Sager (1990) adopts
an onomasiological approach, but based on corpora of lexical data collections, adapted to the
creation of special linguistic vocabularies. The descriptive/socio-terminological theory in France,
theorized by Gaudin [4] and Boulanger [5], leads to the study of the actual use of language, the
importance of external influences, and the diachronic evolution of language. This last stage
was subsequently given prominence in cognitive theory, which studies the cognitive processes
involved in the use and processing of language. Its main exponent is Geeraerts. In our case, we
are projecting ourselves into the work of Temmerman (2000) and her formulation of the unit
of understanding (UoU), i.e. categories whose prototypical structures are constantly evolving,
which requires us to take into account the conceptual nature of terminologies, as well as into
the wake of the research of Christophe Roche and his onto-terminological approach[6].


2. Information Technologies (IT) corpora
Data are not only the new objects (collections) of heritage, they are also the new corpus. To
build our corpus, we chose to model the metadata of museum collections in the form of a
knowledge graph, in the Semantic Web paradigm, which promotes links between data and is
easily augmented by the arrival of new data. A crucible of IT heritage, our Knowledge Graph
was developed using the Wikibase environment, after collecting and harmonizing the museums’
metadata (native XML, PDF, etc. formats, free fields) and then converting it to CSV, RDF and FAIR
standards so that it could be opened up and linked on the web (Linked Open Data, LOD). The
Knowledge Graph is structured by a first-level ontology based on CIDOC-CRM (International
Committee for Documentation of the International Council of Museums, Conceptual Reference
Model) and the EDM (Europeana Data Model) on the Protégé environment.


Figure 1: Extract from the first level ontology that structures the ITinHeritage knowledge graph:


   The Knowledge Graph currently contains more than 25,500 artifacts and is set to grow. It
perpetuates and opens up the IT heritage, allowing it to be exploited, and above all it offers us a
choice textual corpus, very representative of the new corpora that terminologists are facing
today, resulting from Big Data. This multilingual corpus (French, English, German, and Italian)
is the basis for a study of the discourses that represent IT knowledge, which we will describe in
detail. With the help of experts in the field (Pr Marie Gevers from NAM-IP and Xavier Heron
from ACONIT), we are embarking an onto-terminological work through the construction of a
quadrilingual (FR, EN, IT, GE – by now, the corpus has been explored for French and English)
and multimedia dictionary, and an ontology of IT field, based upon the founding models of
terminological work (ISO standards 5078, 704 and 1087).


3. From terms to concepts in the Information Technology (IT)
   field: the appropriation of scientific and technical knowledge
   versus technicality and variety of languages?
If we focus on language, we must take it in all its complexity. It is true that, as Bertrand Russell
notes, in an ideal language there should be only one word for a single object, and any complex
object would be expressed by a combination of words, each for each characteristic of the object
[7]. But Democritus already noted in antiquity that: 1) different objects are often designated
by the same name; 2) the same object is often designated by different names; 3) the names
designating an object may vary over time; 4) the reasons for which names are linked to objects
vary greatly [8]. Thus, the same object is designated by different terms in different languages.
However, ”there are scientific and technical fields that require a conceptualization of the world
and the creation of unambiguous names for its components” [9]. This is mandatory for making
knowledge accessible, especially when we are addressing a wide audience.
    A fortiori, when aiming to build a science dictionary, the dissemination of scientific and
technical knowledge requires explaining the significance of the names of concepts in a field,
i.e. specialist terms. The outreach and didactic purpose, places greater emphasis on the
notional dimension, its accuracy, and its deciphering to cover all the knowledge in a field. In
essence, they are aimed at a more restricted audience than encyclopedic dictionaries, even if the
primary intention is to disseminate knowledge as widely as possible and reach everyone. Their
construction involves, in our case, the participation of historians of science and especially of
computing. The transmission of scientific and technical concepts to as many people as possible,
necessarily requires the integration of a meta-linguistic discourse and textual and semantic
strategies, to translate the accumulation of specialized terms into an understandable language.
    In terminology, a descriptive strategy will focus on differentiating notions from linguistic
links, to make the names of concepts, that are opaque to the uninitiated, or even unknown,
readable, and appropriable. The aim here is to bring scientific concepts into the cultural
mainstream [10]. This is because the materiality of language is just as much an obstacle to
access to specialized knowledge, as the lack of mediation of scientific and technical objects. The
linguistic forms (signifiers) of science can turn out to be just as unintelligible as the object they
name. Paraphrasing and syntax are two tools that can avoid unwelcome heaviness. But in the
transmission of knowledge, it is the concept that needs to be integrated rather than its name,
even if Putnam [11] shows that we can talk about knowledge without having acquired it. Thus,
it is not enough to acquire a lexical system; we need to acquire a notional system. However,
a new word is apprehended by attaching it to and differentiating it from a network of units
already held. But how do we go about constructing meaning when these pre-requisite units
do not yet exist? There are two possibilities: the exploitation of semantic-syntactic relations
and lexical relations. Semantic-syntactic relations such as ”typical object”, ”typical action” and
”typical agent” (Lerat, 1987 and 1988), to which Gaudin (1995) adds ”typical application”, are
useful for transmitting complex knowledge. They list the typical collocations of a lexical unit
to give, through this combination, a precise idea of the use made of the named object. Here is
an example of metadata from inventory collections for the artifact « Tabulator BS 120 » from
ACONIT Conservatory (Grenoble, France):

    • «La tabulatrice Bull BS 120 se compose d’un lecteur de cartes (traitant 150 cartes par
      minutes), d’une imprimante (cadence de 150 lignes par minute sur une largeur de 92
      colonnes), d’une perforatrice de cartes (débitant soixante-quinze cartes par minute).
      Elle dispose d’un calculateur mécanique qui lui permet d’exécuter les quatre opérations
      arithmétiques, des opérations logiques et de mémoriser des informations. Elle offre un
      système de programmation amovible : le ”tableau de connexion”, qui était spécialement
      câblé pour chaque traitement. Sa technologie est entièrement électromécanique.».
    • Typical action: lire, additionner et imprimer
    • Typical object: caractères
    • Typical agent: calculateur mécanique, lecteur de cartes, imprimante, perforatrice de
      cartes (ateliers mécanographiques à cartes perforées)
    • Typical application: calcul électromécanique

   Particularly well-suited to popularizing science, semantic-syntactic relations lead the work
of textual analysis toward the modeling of knowledge. A fortiori, they make it possible to
express the practices surrounding the object. Putting its application into context can only
be conducive to understanding, even if this description is not sufficient for the acquisition of
knowledge. The path towards the concept to be understood is thus supported by a description
of the object’s properties, i.e. a ”making sense” and a delimitation, with these properties echoing
the characteristics of the concept.
   But the scale of complexity of a specialist language for the layman varies enormously from
one field to another. In the example cited above, the semantic-syntactic relationships presented
are not enough to provide access to meaning. Specialized terms need to be translated, otherwise,
there is a risk that too many obscure terms will be brought together and accumulate, creating a
barrier for the uninitiated. For example, the typical application ”electromechanical calculation”
can be presented as a ”precursor of computer programming”, which provides the reader with a
reference point. Thus, not all scientific languages are equal in difficulty in understanding their
concepts, and they do not have the same inclusions in everyday language. Specialized language
is not a whole. So many obstacles stand in the way of understanding by the layman: complex
terms (mechanographic workshops with punched cards), eponyms (Turing machine, Moore’s
law), words’ morphologies such as acronyms, complicated by version numbers (IBM 1130),
special characters, etc. However, these signifiers can also serve as echoes of fuzzy knowledge, a
visual and semantic anchor of recognition for the reader. Contemporary science is characterized
not only by its highly technical nature but also by its social roots. Technical objects are at the
heart of our daily lives, all the more so for science such as Information Technologies. Many
terms are an integral part of our knowledge, such as a keyboard or computer mouse, and it is
through their typical application that meaning can be built up towards more complex notions
and names attached to these objects. Terminological complexity is thus found between the
common language and the specialist language. But language uses can also differ within the
same specialty, geographically (diatopy), over time according to technological developments
(diachrony), and between communities of practice (diastracy). For example, a ”first generation
calculator” for the computer amateur (source: Wikipedia), will be referred to as a ”mechanical
calculator” by a computing history expert, or as an ”ancient calculator” by a general history
expert. For the initiated, one of the objects embodying this concept is an ”abacus” (“abaque” in
French), or ”first digital tablet”. For the general public of the 21st century, however, the concept
has a completely different connotation, and the term is associated with a modern digital pad
(portable computer), whereas the object, which dates back to ancient times (it has been found
as far as 500 BC), is actually made up of clay balls and tokens, the ”calculi” (lat.). ”These were
used for arithmetic until 7000 BC, and evolved into a device with rows of moving parts, better
known from 500 BC under the name of ”abacus” (“boulier” in French, which distinguishes the
two types of objects in their diachronic evolution, contrary to English). Differences in language,
particularly among experts, stem from different representations of the world and different
strategies: ”... we only manipulate reality through the representations we have of it” [12]. In
the field of Information Technologies, we can therefore find very divergent denominations:
those of the general public, the amateur, the expert in the world of technology or heritage,
etc. Museum inventories, for example, use highly controlled language, demonstrating, just as
much as the accumulation of knowledge around artifacts with documentation, a mastery of the
subject, setting up curators as credible guardians of an almost sacralized heritage. The ”Baby”
for the amateur is called by museums ”Manchester Baby” because of where it was built, or
”Manchester Small-Scale Experimental Machine” (London Science Museum) in reference to its
history: the ”Small-Scale Experimental Machine” (SSEM) was the world’s first von Neumann
architecture machine. Built at Manchester’s Victoria University by Frederic Calland Williams,
Tom Kilburn, and Geoff Tootill in 1948, it is what is known to the general public as the first
”computer”. According to an expert in the history of computing, mainstream ”computers” are
”stored-program computers, in the strict sense of the term, i.e. von Neumann-type computers,
which have a very precise meaning: they are electronic calculating machines with a central
memory large enough to hold the program being executed, as well as the data”. We are therefore
faced with highly diversified linguistic uses, which can only lead to a blurring of concepts
and names. On the one hand, there is the everyday language, and on the other, there are the
specialist languages, which do not have the same status and are based on extralinguistic realities
shared by sub-communities in the same field. In these languages, words can have a different
linguistic weight, for example, when they are themselves the name of a concept [13]. In this
inclusive relationship between everyday language and its specialized subsets, words are the
only tangible elements available to us to capture the representations to which they refer. We
naturally find them in texts that can serve as a basis for terminology work. Linguistic analysis of
corpora enables us to extract syntagms that can be linked together by lexical relations. Lexical
gender and partitive relationships are essential to the acquisition of new knowledge, and ISO 704
identifies them as the foundation of terminology work. Hyperonymy, hyponymy, meronymy,
antonymy, or isonymy, effectively situate the new term in a notional context. Hyperonymy in a
genus/species relationship, hyponymy in class descriptions, antonymy, and especially isonymy,
in notional differentiation. ”To determine the meaning of a unit, the reader needs contrasts, and
the canonical definition provides only one mode of category construction, that of specialization
with the genus, defining it” [10]. Isonymy, as ”any relationship linking two competing units,
usually at the same level, without it being possible to establish a hierarchy that is valid from all
points of view” [14], makes it possible to establish a fine-grained semantic relationship between
closely related concepts, without any hierarchical link, and situating them as part of a whole.
Concepts are thus described in a grouped manner. As we shall see, these lexical relationships are
highly represented in our corpus, which is drawn from science museum collections, according
to museum mediation strategies that are pedagogical (educating the public), narrative (telling
the story of a technology and its inventors or producers) or descriptive (describing an object
according to its components or functions).
4. Information Technologies (IT) lexicons
The study of computer terminology is undoubtedly one of the fields that has most interested
terminologists, who have delved into its morphological and semantic aspects (among others,
see [15]), its uses in specialized discourse and its textual dimension, with particular reference to
corpus linguistics (for an initial fondamental study, see Condamines, 2005), and issues related
to interlingual comparison (among others, [16]). Particularly interesting in our perspective is
L’Homme, 2008[17], comparing terminological, ontological and general resources in describing
computing terminology1 . The ITinHeritage project is heritage-based, deeply interdisciplinary
in nature, and the dialogue between disciplines presupposes an interrogation of fundamental
notions such as term and concept from different points of view - namely, from a knowledge-
based approach and from a linguistic-based one. As L’Homme states, ”Given the differences
between the assumption of lexical-based and knowledge-based approaches and the principle
on which they rely, the question is whether they can be used simultaneously in terminology
work” [18]. In this perspective, by its focus on heritage-based, interdisciplinary vision, the
ITinHeritage project can be considered as an attempt to find an answer to this question. The
main issue that interests us at this stage of the project is the identification of terminological units
and lexical relations, necessary to structure domain knowledge ; thus, our analysis stands in
continuity with many studies in the field of terminology, which question the interface between
terminology and knowledge-based approaches [19][20][21][22].
   In a first phase, we postulate that the study of the linguistic uses [23] in the IT heritage field,
will enable us to identify names of concepts and their significations, socially stabilized within
the various communities that found this domain, by identifying the relationships mentioned
above. This detailed analysis will ultimately enable us to construct a dialectic between experts,
science, and the public, and to observe the joint evolution of language and technology [24]. As
the methodology in textual terminology is based on the triptych specialized corpus - experts
- digital tools [25], our terminological work begins with a semasiological approach, through
lexical extraction from museum metadata and linguistic analysis of our corpus. The corpus,
composed by the collection of metadata from museums involved in the project, is built following
the text selection criteria ISO/DIS 5078:2023. It is precisely from these metadata that a first
terminological extraction, always based on the regulatory criteria of ISO 5078, is currently carried
out. Our extraction approach, which will be applied to all languages included in the project,
can be considered semi-automatic, comprising an initial automatic selection by extraction tools
(Termostat and Sketch Engine) and a subsequent manual intervention for lexical cleaning. We
have relied on a predominantly hybrid approach, combining statistical techniques initially and
then linguistic ones. We will have the opportunity to analyze these techniques in detail with
respective examples. Starting with the French corpus, whose data come from the ACONIT
museum, the following fields were taken into consideration: description, name, model, and
use. In this corpus, Description and Use are the only two fields composed of multiple textual

    1
     L’Homme (2008), made a lexical extraction work regarding the computing field. She extracted the 75 most
frequent candidate terms from a corpus of specialized resources in the field of computer science using TermoStat.
She demonstrated how, in terms of formal coverage, the resource that exhibited a greater presence of these terms,
and therefore more completeness, is WordNet, a general-purpose lexical database, compared to domain-specific
online dictionaries and ontologies. These latter resources even yielded differing results among themselves.
sentences, while name and model can be defined as made up by a sort of controlled vocabulary.
   For the first two fields, it was necessary to rely on a strategy of automatic terminological
extraction. Various tools were tested and compared, including Sketch Engine [26], TermoStat
[27], and we also relied on the NLTK package of the Python programming language [28] as far
as programming languages are concerned, they are now considered relevant alternatives to
traditional extraction tools when facing corpora derived from big data [29]. However, after a
first attempt, in the initial extraction phase done by Termostat as well as with the NLKT library,
we encountered a high level of noise, which would have required an excessively demanding
manual intervention. Finally, Sketch Engine, a tool for textual management and analysis for
extraction, proved to be the best option. The comparison corpus is the French Web Corpus
2023 (frTenTen23), composed of texts collected from the internet, totaling approximately 24
billion words in the French language. The corpus annotates words with POS tags using the
FreeLing tool. This comparison corpus belongs to the TenTen corpus family, extending across
40 different languages (Jakubíček, M., Kilgarriff, A., Kovář, V., Rychlý, P., Suchomel, V., 2013).
Thanks to the TenTen corpora, we can use the same procedure applied to the French corpus
for other languages as well. For the terminological extraction by the Sketch Engine function
keywords, the selection filter was set to a maximum of 5000 items, considering with particular
attention the MULTI-WORD TERMS results. The first SINGLE-WORD terms obtained, such as
micro-ordinateur, disquette, macintosh, microprocesseur, azerty, hewlett-packard, powerbook,
olivetti, alphanumérique can in many cases be considered quite generic. We decided to focus on
multi-word terms which seemed to be the most frequent terminological pattern in IT. Regarding
linguistic analysis of the extracted terms, the TermoStat software was initially preferred: simple
and complex terms were extracted by selecting lexical categories such as nouns, adjectives, and
verbs. TermoStat uses a method of contrast between specialized and non-specialized corpora to
identify terms. Those of our greater interest are terms with the highest specificity score, with
a lower frequency in the general reference corpus, but more representative in the specialized
domain. TermoStat has some limitations: it is not possible, for example, to extract acronyms
or terms beginning with an uppercase letter followed by a series of numbers. Extraction tools
tend to overlook these patterns, while they represent a major terminological source in the
IT industry. In most cases, these are names of machines and their models or manufacturers
- e.g., Gamma 30, IBM 1130, …) and in this case, regular expressions and the Corpus Query
Language (CQL) of the Concordance section of Sketch Engine can be used, allowing the analysis
of words, sentences, documents in various possible contexts, extracting and analyzing more
complex grammatical patterns. However, a first analysis provide some useful insight about the
morphological patterns of candidate terms, as shown below:
   Some preliminary observations can be made, on the basis of this first linguistic analysis. The
description and the use fields present different characteristics as for the identified candidate
terms. The description field focuses on machines and their components, as shown by the 10
most frequent candidate terms (classified on the basis of their specificity score):
   Verbs, which make up an interesting percentage of this sub-corpus, are very often markers of
meronymic relations [30], as in the case of posséder (132 occurrences), contenir (101 occurrences),
comprendre (90 occurrences), comporter (30 occurrences). The markers often indicate references
to the composition of the machines or their technical functioning. In the case of adjectives, they
mostly refer to the shape or functions of the machines. For example, ’clavier,’ the most frequent
Figure 2: ACONIT corpus - description field - 2998 candidate terms:


Figure 3: ACONIT corpus - use field - 5967 candidate terms:


noun among the candidates provided by TermoStat (Frequency 509, Score (Specificity) 269.89),
is associated with adjectives such as alphanumérique (153.97), numérique (44.83). Whereas
the Description field involves descriptive museum mediation centered on the components
of the object, the Use field involves narrative mediation strategy. It reveals a more complex
Table 1
Frequency and Specificity of Regrouping Candidates
                            Group candidate     Frequency    Specificity
                            Keyboard               509         269.89
                            Floppy disk            296         231.41
                            Connector              258         227.80
                            Microcomputer          244         214.53
                                                   212         202.31
                            Button                 242         177.87
                            Floppy disk drive      150         174.22
                            Key                    325         170.24
                            Printer                137         154.90
                            Machine                443         145.76


structure, where the terminological extraction focuses not only on artifacts, but also on practices
and actors involved in IT history, as suggested by the presence of the verb utiliser and of
the nouns utilisateur et utilization. Verbs in this section are more often focused on relating
specific actions or processes (permettre 471 occurrences, pouvoir 308 occurrences, servir 123
occurrences, agir 72 occurrences, fonctionner 60 occurrences), but we can also find verbal
conceptual relation markers such as appartenir (9 occurrences), whose related terms highlight
meronymic relationships: appartenir-lignée (73.17); appartenir-famille (56.93); appartenir-micro-
ordinateur (48.81). Moving to the English language corpus, made up of metadata provided by
the London Science Museum, the fields for lexical extraction taken into consideration are the
same as for the French corpus in order to obtain a reliable comparative study. Now we can
make an initial comparison between corpora in different languages. For example, the results
of the Description field extraction show that this field most focuses on artifacts and concrete
objects contextualized within a history or practice, and in this respect is similar to the mediation
strategy observed in the Use field of the French corpus. Moving to another museum and another
culture changes the relationship with the object and the representations of the world to which
it belongs. Here, the morphological pattern Adj+Noun is more productive in this field, with
examples such as personal computer (110 occurrences), video game (84 occurrences), electronic
calculator (53 occurrences), electronic component (41 occurrences). The most frequent verbs
focus on concrete actions: manufacture (195 occurrences), build (120 occurrences), work (101
occurrences), even if we can also find an important number of occurrences for verbs such as
include (51 occurrences), which can be considered as markers of meronymic relations.


5. The need for more in-depth approaches
This semasiological approach results in lexicons that reflect the initial corpus and highlight
rather than resolve the linguistic variety of a domain. It is not yet a consensual conceptualization.
But if we are interested in the way experts name their domain, we are interested in the way
they conceptualize it. However, our experience has shown us how difficult it is to draw up a
semantic network solely on the basis of this semasiological work [31]. It poses the problem
of finding only the variety of names of domain concepts in the texts, and not the concepts
themselves. This type of study, although essential for laying the foundations of terminology
work, stops at the meaning of words observed in discourse, defined according to their uses. To
compensate for linguistic variations and bring out a common meaning, it is advisable to take
an interest in the extralinguistic part of terminology work, centered on the relationship of the
concept to the object. The meaning of a word is intended to be independent of its uses and
is defined as a meaning that is consensually standardized within a community, referring to a
conceptualization of the world. The meaning of a word is ”signification actualized in discourse”
[12]. Thus, ”it should be remembered that all terminological work should be based on concepts
and not on terms” (Felber, 1984). Terms are of interest to the terminologist because they denote
a concept, a kind of bijective form, a bridge, reflecting the continuous dynamic between the
linguistic part of the real world and the extralinguistic part of the symbolic one. The notional
(or conceptual) world is the translation of how we apprehend objects in the real world. The
standardization required by terminology work makes it possible to fix this consensual notional
system within a community of practice. In order to take into account the specificities and
variations of the linguistic system, the notional system is more easily constructed using an
onomasiological approach. In this formalization based on the convention in the Latin sense of
foedus, the names of concepts must be situated within the notional system as a whole, even if
this means artificializing them to provide a signification that goes beyond usage. Structuring
concepts according to subsumption and difference relations enables us to build a standardized
vocabulary that blurs the varieties and ambiguities of natural language. The linguistic relations
mentioned above and observed in our corpus, above all, the genus and the merological relation,
enable a detailed description of objects in that they provide the notional environment (specific
features or constituents of an object). In this way, ““ontological relations” are defined as “indirect
relationships between the notions”, the most important of which is the merological (partitive)
relationship” [12]. So, objects are defined in the sensible or intellectual reality, according to
the properties that the experience of empirical practice forms for them in the real world, to
which the linguistic stratum corresponds. By abstraction, these notions (concepts) reflect and
are organized according to these properties in the meta-linguistic stratum (the symbolic world),
which are translated into a set of characteristics derived from the reason. The organization of
concepts to each other within the notional system therefore requires us to question the very
essence of things, their eidetic characteristics.
   At this point, the help of experts in the field is mandatory. They have a socio-linguistic
responsibility in the transfer of knowledge because they are the only ones to master the discourse
of the field, its conceptual representation, and its consensus [11]. Our work therefore is now to
model this knowledge using an ontology that reflects the conceptualization of the IT world. The
construction of the IT domain’s onto-terminology is facilitated by the use of both the Protégé
environment for the structuration of the knowledge graph, and the TEDI AI tool [6], which
is more accomplished than Protégé for terminology work according to the ISO 704 and 1087
standards [6]. At the same time as building the dictionary, it allows us to define the ontology and
analyze the conceptualization phenomena. This novel modeling of IT heritage knowledge based
on CIDOC-CRM upper-ontology, will integrate the OWLTime model [32] to be diachronic and
represent technological developments in the field. It will complement the first-level ontology
by integration via CIDOC classes. Our onto-terminology has to be evolutionarily because IT
cannot be represented as fixed in time when it is constantly being created. For this purpose, we
mobilize automatic language processing (ALP), clustering and machine learning to create an
automated system for extending the terminology and ontology initially created by linguists.
The work is based on data sets recognized by experts. New classes can be created by working
on partitive relations. We add a layer of description logic to formally specify the definition
of classes using the Protégé environment. Our contributions here focus on the application of
automated methods for completing manually constructed onto-terminologies to keep pace with
the evolution of the domain and the naming of new objects and knowledge.
   In this way, our terminological work is supported by the Computer Science Ontology Classifier
(CSO) [33], a tool originally developed to automatically categorize Computer Sciences research
papers, based on abstracts, into a comprehensive semantical network, especially the gender and
partitive relationship and which we are founding our terminological work. We decided to adapt
this advanced AI-driven tool to our specific research needs, employing it to analyze particular
descriptive fields, first within our dataset, secondly on web datasets, such as Wiktionary and
BDPedia. It will aim us to detect neo-terms relating to new objects in the IT field and allow us
to fit these neo-terms into our ontology. This adaptation involves a dive into the classifier’s
operational mechanics, necessitating modifications to accommodate our unique ontology, which
diverges from the broad Computer Sciences focus of the CSO. A key advantage in this adaptation
process is the close alignment between our research domain and the original CSO corpus,
allowing us to utilize the existing word to vec embedding model [34] which was trained on 4.6
millions English papers in the field of Computer Science and align to the corpus we are using.
We remain open to exploring the potential benefits of retraining these embeddings in the future,
to see if such refinements could enhance the classifier’s accuracy and sensitivity to the nuances
of our domain.
   Finally, because our work would be useless if inaccessible to the public, we are building
a web portal to provide access to the Knowledge Graph. Other platforms federate museum
collections [35][36][37] or safeguard software [38], but these projects enhance objects, whereas
we aim to enhance knowledge. The platform, designed as a place for transferring knowledge,
will offer: navigation via a time map to represent the technological evolution across history; a
graph based upon the domain ontology to provide input to the knowledge graph via concepts
representing domain knowledge; knowledge completion via the multilingual dictionary; and
natural language querying. Querying the Knowledge Graph (RDF) in natural language requires
SPARQL (SPARQL Protocol and RDF Query Language) queries. This approach significantly
lowers the barrier to entry for users unfamiliar with technical query languages, making the
data within our Knowledge Graph accessible to a broader audience. So, here again, we must
mobilize NLP, through question-answering (QA) technologies, with IA/IHM tools. SparNatural
is a well-known open-source tool available for querying by non-experts [39], which can be
used to navigate our knowledge graph based on the domain ontology. Comparatively, the
QAnswer tool [40] is end-user-oriented and automatically translates natural language queries
into SPARQL. It is intuitive and also learns from user feedback. We are currently assessing the
suitability of these two tools for our needs.
6. Conclusion
The Information Technologies heritage is defined by the intrinsic tension between ”objects -
languages - representations” [41]. Technologies, terminologies, and conceptualizations in the
field are constantly evolving and interrelated. Our work aims to highlight and analyze these
developments. It is the starting point for a better understanding of this recent and proliferating
field, and of its heritage and dissemination, which are of prime importance, as it bears witness to
our contemporary era and its digital and societal transformation. In this sense, the ITinHeritage
research project is conducting this onto-terminological work as an anchor point for a wider
epistemological reflection to come, on the question of IT, placing at the heart of the question
of this heritage, the history of a societal mutation and the emergence of new knowledge in
terms of representations of the world, through the definition of this heritage and the new places
of knowledge and practices. For this purpose, our multi-disciplinary team is committed to
harnessing IT for its own benefit and, in its own image, combining approaches from the social
sciences and humanities and the computer sciences. The new heritages and corpuses formed
by data are leading us towards new approaches to terminology work. In doing so, we hope to
provide an epistemological reflection not only on the subject of our study but also on the tools
and methods related to language (linguistics and computer science).


Acknowledgments
This work has been partially supported by MIAI @ Grenoble Alpes, (ANR-19-P3IA-0003). We
thank also The QA Company, which is providing us with technical assistance as well as access
to QAnswer https://qanswer.ai


References
 [1] I. Kant, Critique de la raison pure, Aubier, Paris, 1997.
 [2] C. Jacob, Rassembler la mémoire, volume 4, Diogène, 2001.
 [3] Aristotle, L’Empiria, Librairie Philosophique J. Vrin, Paris, France, 2020.
 [4] Gaudin, Socioterminologie: des problèmes semantiques aux pratiques institutionnelles,
     Presses universitaires de Rouen et du Havre, 1993.
 [5] J. C. Boulanger, Présentation: images et parcours de la socioterminologie, Meta, 40.2,
     (1995) 194-205 (1995).
 [6] C. Roche, M. Papadopoulou, Mind the Gap: Ontology Authoring for Humanists, in:
     Proceedings of JOWO: The Joint Ontology Workshops, Graz, Autriche, 2019.
 [7] B. Russell, The philosophy of logical atomism, The Monist 28-29 (1918-1919) 495–527 and
     32–63. Reprinted in Russell B., Logic and Knowledge, London, Allen and Unwin, 1956.
 [8] H. Diels, Die fragmente der Vorsokratiker griechisch und deutsch, Weidmann, Berlin, 1903.
 [9] C. Roche, Le terme et le concept : fondements d’une ontoterminologie, 2007. Actes TOTh
     2007.
[10] F. Gaudin, Dire les sciences et décrire les sens: entre vulgarisation et lexicographie, le cas
     des dictionnaires de sciences, TTR: traduction, terminologie, rédaction 8 (1996) 11–27.
[11] H. Putnam, Raison, vérité et histoire, Editions de Minuit, Paris, 1984.
[12] C. Roche, Terminologie et ontologie, Langages (2005) 48–62.
[13] S. Auroux, Avant-propos, PUF, Paris, 1990, pp. vii–xx.
[14] A. Assal, et al., Sémantique et terminologie: sens et contextes, Terminologie et traduction
     (1992) 411–421.
[15] V. Claveau, M. C. L’Homme, Apprentissage par analogie pour la structuration de
     terminologie-utilisation comparée de ressources endogènes et exogènes, in: Actes de la
     conférence terminologie et intelligence artificielle (TIA-2005), 2005.
[16] J. Humbley, La traduction des métaphores dans les langues de spécialité: le cas des virus
     informatiques, Linx. Revue des linguistes de l’université Paris X Nanterre (2005) 49–62.
[17] M. C. L’Homme, Ressources lexicales, terminologiques et ontologiques: une analyse
     comparative dans le domaine de l’informatique, Revue française de linguistique appliquée
     (2008) 97–118.
[18] M. C. L’Homme, Terminology and lexical semantics, in: P. Faber, M. C. L’Homme (Eds.),
     Theoretical Perspectives on Terminology. Explaining terms, concepts and specialised
     languages, John Benjamins, Amsterdam/Philadelphia, 2022, pp. 237–259.
[19] M. C. L’Homme, Sélection de termes dans un dictionnaire d’informatique: Comparaison
     de corpus et critères lexicosémantiques, in: Actes Euralex 2004, 2004, pp. 583–593.
[20] M. C. L’Homme, Conception d’un dictionnaire fondamental de l’informatique et de
     l’internet: sélection des entrées, Le langage et l’homme 40 (2005) 137–154.
[21] V. Malaisé, Méthodologie linguistique et terminologique pour la structuration d’ontologies
     différentielles à partir de corpus textuels, Ph.D. thesis, Université Paris-Diderot - Paris VII,
     2005.
[22] I. Meyer, Concept management for terminology. a knowledge engineering approach, in:
     P. Faber, M. C. L’Homme (Eds.), Theoretical Perspectives on Terminology. Explaining
     terms, concepts and specialised languages, John Benjamins, Amsterdam/Philadelphia,
     2022, pp. 110–126.
[23] M. Rossi, Termes et métaphores, entre diffusion et orientation des savoirs, La linguistique
     57 (2021).
[24] C. Djambian, M. Rossi, G. d’Ippolito, La médiation des objets aux savoirs scientifiques et
     techniques, in: Actes de la conference TOTh, Chambéry, 2023.
[25] N. González Granado, P. Drouin, A. Picton, De l’analyse statistique à l’apprentissage
     automatique : le langage R au service de la terminologie, Éla. Études de linguistique
     appliquée 208 (2022) 447–467.
[26] A. Kilgarriff, et al., The Sketch Engine : Ten years on, Lexicography 1 (2014) 7–36.
[27] P. Drouin, Term extraction using non-technical corpora as a point of leverage, Terminology.
     International Journal of Theoretical and Applied Issues in Specialized Communication 9
     (2003) 99–115.
[28] S. Bird, E. Klein, E. Loper, Natural language processing with Python: Analyzing text with
     the Natural Language Toolkit, O’Reilly Media, Sebastopol, 2009.
[29] L. Anthony, Programming for corpus linguistics, in: M. Paquot, S. T. Gries (Eds.), A
     practical handbook of corpus linguistics, Springer, Cham, 2021, pp. 181–207.
[30] L. Lefeuvre, A. Condamines, MAR-REL : une base de marqueurs de relations conceptuelles
     pour la détection de Contextes Riches en Connaissances (MAR-REL : a conceptual relation
     markers database for Knowledge-Rich Contexts extraction), in: Actes des 24ème Con-
     férence sur le Traitement Automatique des Langues Naturelles. Volume 2 - Articles courts,
     ATALA, Orléans, France, 2017, pp. 183–191.
[31] C. Djambian, Valorisation d’un patrimoine documentaire industriel et évolution vers
     un système de gestion des connaissances orienté métiers, Ph.D. thesis, Université Jean
     Moulin-Lyon III, 2010.
[32] F. Pan, J. R. Hobbs, Temporal aggregates in owl-time, in: FLAIRS, 2005, pp. 560–565.
[33] A. A. Salatino, et al., The computer science ontology: A comprehensive automatically-
     generated taxonomy of research areas, Data Intelligence 2 (2020) 379–416.
[34] T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient estimation of word representations in
     vector space, arXiv preprint arXiv:1301.3781 (2013). Available at https://arxiv.org/abs/1301.
     3781.
[35] P. Szekely, et al., Connecting the Smithsonian American Art Museum to the Linked Data
     Cloud, in: The Semantic Web: Semantics and Big Data, ESWC 2013, Berlin, Germany,
     2013.
[36] V. de Boer, et al., Amsterdam museum linked open data, Semantic Web 4 (2013) 237–243.
[37] M. Doerr, et al., The Europeana Data Model (EDM), in: Actes de IFLA 76, Gottenburgh,
     Suède, 2010.
[38] R. Di Cosmo, Software heritage: why and how we collect, preserve and share all the
     software source code, in: 2018 IEEE/ACM 40th International Conference on Software
     Engineering, 2018.
[39] F. Clavaud, T. Francart, Sparnatural, un éditeur graphique souple et intuitif pour explorer
     des graphes de connaissances, in: Colloque Humanistica 2022, 2022.
[40] D. Diefenbach, et al., Towards a question answering system over the semantic web,
     Semantic Web 11 (2020).
[41] L. Smith, Heritage and its Intangibility, in: Ahmed Skounti y Ouidad Tebbaa: De l’im-
     matérialité du patrimoine culturel, Bureau régional de l ‘Unesco de Rabat, Marrakech, 2011,
     pp. 10–20.