<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Domain Adaptation with Linked Encyclopedic Data: A Case Study for Historical German</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Thora Hagen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institut für Deutsche Philologie, Julius-Maximilians-Universität Würzburg</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <volume>1</volume>
      <fpage>443</fpage>
      <lpage>461</lpage>
      <abstract>
        <p>This paper outlines a proposal for the use of knowledge graphs for historical German domain adaptation. From the EncycNet project, the encyclopedia-based knowledge graph from the early 20th century was borrowed to examine whether text-based domain adaptation using the source encyclopedia's text or graph-based adaptation produces a better domain-specific model. To evaluate the approach, a novel historical test dataset based on a second encyclopedia of the early 20th century was created. This dataset is categorized by knowledge type (factual, linguistic, lexical) with special attention paid to distinguishing simple and expert knowledge. The main finding is that, surprisingly, simple knowledge has the most potential for improvement, whereas expert knowledge lags behind. In this study, broad signals like simple definitions and word origin yielded the best results, while more specialized knowledge such as synonyms were not as efectively represented. A follow-up study was carried out in favor of simple contemporary lexical knowledge to control for historicity and text genre, where the results confirm that language models can still be enhanced by incorporating simple lexical knowledge using the proposed workflow.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;language models</kwd>
        <kwd>knowledge graphs</kwd>
        <kwd>encyclopedic knowledge</kwd>
        <kwd>semantics</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Based on Ryan’s principle of minimal departure [
        <xref ref-type="bibr" rid="ref31">30</xref>
        ], our understanding of any text is highly
dependent on our previous knowledge of the world. Consequently, depending on the type of
text, for example a technical paper, we would also need to be experts in the same scientific field
to be able to follow the arguments made. Another example would be historical literature, where
certain cues in the text could only be understood by having a solid foundation on societies,
fashion, or politics (among other topics) of that exact time period. The same can be argued
for language models (LMs). When working with texts of a specific topic, type, genre, time
period, etc., the language model’s performance is also dependent on whether the training data
matches the domain of the task at hand. In the case of digital humanities where, depending
on the research domain, large text corpora may not be as readily available in comparison to
contemporary English, the domain representation within the language model may not be stable
enough. When employing a LM, researchers can either turn to a specialized pre-trained LM for
the domain if available (e.g., MacBERTh [
        <xref ref-type="bibr" rid="ref27">26</xref>
        ] for historical English), or they have to perform
domain adaptation of a general domain LM.
      </p>
      <p>This paper explores how an encyclopedia-based knowledge graph (KG) can be used to adapt
language models specifically for historical German, with a focus on injecting the knowledge
from that period. The goal is to demonstrate a simple workflow for researchers in the digital
humanities to infuse LMs with domain knowledge using a KG. Especially in the humanities,
there may be specialized resources available, for example dictionaries, thesauri, or lexicons,
which can be transformed into knowledge graphs (see for example projects LiL1a and PURA2).
KGs provide another form of knowledge representation aside from text, and they generally
ofer a wider variety of adaptation methods than text can. In this paper, the focus lies on the
comparison of text and KG.</p>
      <p>Specifically, this paper is concerned with the following research questions:
• How does adding a KG based on one encyclopedia as training data of a LM compare
to simply adding that exact encyclopedia, i.e., is creating a KG worth it for creating a
knowledge infused LM?
• What kind of knowledge shows the most improvement when injecting an encyclopedic</p>
      <p>
        KG into a LM (factual, lexical, linguistic)?
• Is a historical encyclopedia suited for historical domain adaptation?
For the experiment, two German encyclopedias from the early 20th century were chosen – one
for training (Meyers Großes Konversations-Lexikon [
        <xref ref-type="bibr" rid="ref28">27</xref>
        ], dated in 1905, in the following referred
to as Meyers) and one for evaluation (Brockhaus Kleines Konversations-Lexikon [
        <xref ref-type="bibr" rid="ref5">4</xref>
        ], dated in 1911,
in the following referred to asBrockhaus). The former has been transformed into a semantic
knowledge graph by EncycNet.3 In a follow-up study, a comparison is also made between
injecting contemporary linked semantic data, namely WordNet, and encyclopedic KGs in terms
of improving lexical semantic relations in LMs.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <sec id="sec-2-1">
        <title>2.1. Knowledge Enhanced Pre-trained Language Models</title>
        <p>
          The idea to inject language models with knowledge graphs belongs to the research area of
knowledge enhancement. Generally speaking, not every form of knowledge can be learned
by feeding vast amounts of continuous texts to a transformer model. Missing information,
meaning explicit grounding in the real world 7[], is not only apparent for domain knowledge
(expert knowledge about e.g. drugs and diseases) but common sense and factual knowledge
as well [
          <xref ref-type="bibr" rid="ref4">3</xref>
          ]. As an example, the newest development of OpenAI to include images and other
media for model training (GPT-4o) seeks to tackle the grounding problem as well.
        </p>
        <p>
          Knowledge enhanced pre-trained language models (KEPLMs) are language models that have
been tuned to accommodate a specific area of knowledge better. While algorithmic adaptation
is possible, many methods for creating KEPLMs rely on additional structured knowledge to
inject the LMs with. These can be, among others, additional text snippets describing concepts
1https://lila-erc.eu/
2https://pric.unive.it/projects/pura/home
3https://encycnet.github.io/; RDF knowledge graph available athttp://dx.doi.org/10.5281/zenodo.10219192
or entities (e.g. dictionary definitions), tables, syntax trees, triples, rule systems, or knowledge
graphs [
          <xref ref-type="bibr" rid="ref17 ref42">15, 41</xref>
          ]. Knowledge graphs bear an advantage over other structured data forms: They
may be reshaped to other data structures and are thus highly flexible regarding the choice of
method, and they can represent any type of human knowledge, meaning methods devised to
accommodate knowledge graphs are flexible to adapt to any knowledge type.
        </p>
        <p>
          Five diferent categories for knowledge enhancement using KGs can be broadly distinguished
[
          <xref ref-type="bibr" rid="ref29">28</xref>
          ]. The first category is concerned with adapting the masked language modeling (MLM)
training procedure (during pre-training or through continued training) using KG data. Firstly,
the information given in the KG can be used to employ strategic masking during training (e.g.,
to mask multi-word expressions [
          <xref ref-type="bibr" rid="ref36">35</xref>
          ], assign masking probabilities for words through the graph
structure [43], or mask head and tail entities when appearing in the same text passage 3[
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], etc.).
Secondly, the graph can be used to create new corpora through randomwalks1[
          <xref ref-type="bibr" rid="ref8">7</xref>
          ], which can
be used for MLM the same way natural continuous text can. The second category deals with
employing additional tasks, either during pre-training or fine-tuning, which also use the KG as
training data. These tasks can be, for example, creating stable knowledge graph embeddings
[
          <xref ref-type="bibr" rid="ref41">40</xref>
          ], or predicting head, relation or tail of triples from the KG2[9]. The third category attends
to input fusion of KG and text, either by merging text into graph 3[
          <xref ref-type="bibr" rid="ref5">4</xref>
          ], graph into text [
          <xref ref-type="bibr" rid="ref24">23</xref>
          ], or
merging features from the graph into the input layer of the transformer model2[0]. These three
categories have in common that they all aim to change the parameters of the language model.
The final two categories of KEPLMs use KGs at inference (retrieval augmented generation) [
          <xref ref-type="bibr" rid="ref25">24</xref>
          ],
or use the KG as evaluation data for interpretability and probing matters3[
          <xref ref-type="bibr" rid="ref7">6</xref>
          ], where in both
cases the language model keeps its original parameter configuration.
        </p>
        <p>An additional trend for KEPLMs is the usage of adapters. First introduced by3[9] as
K(nowledge)-adapters, adapters are a set of layers introduced to the transformer model, where
during training, only the parameters of the adapters are changed, while the rest of the LM stays
frozen. This is meant to minimize ”forgetting”, where the original knowledge learned during
pre-training gets overwritten, and thus ensures that the injected knowledge stays independent.
In that way, multiple knowledge types can be injected into the model without interfering with
each other or the original model (e.g., as per 3[9], factual and linguistic adapters).</p>
        <p>
          In the following study, the focus lies on randomwalk generation as well as using adapters for
training. Randomwalks have been previously employed for knowledge injection for a
multitude of knowledge domains and tasks: factual and common sense knowledge1[
          <xref ref-type="bibr" rid="ref8">7</xref>
          ], eventuality
modeling [
          <xref ref-type="bibr" rid="ref43">42</xref>
          ], entity classification and link prediction for the biomedical domain [
          <xref ref-type="bibr" rid="ref38">37</xref>
          ], as well
as lexical, medical and factual knowledge graph completion 2[
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. The method has also
previously been employed to create taxonomic word embeddings1[
          <xref ref-type="bibr" rid="ref7">6</xref>
          ]. The intuition of the approach
lies in the assumption that traversing randomwalks in a graph can efectively capture its entire
topology and map its contents into latent space (node2vec algorithm1[0]). The randomwalk
injection method was preferred here, as it allows for a fair comparison between the encyclopedia
enhanced and knowledge graph enhanced language models. As the graph is deconstructed into
text form, both can be created with continued MLM training, and only the input representation
(continuous text vs. randomwalks) is diferent.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Domain Adaptation</title>
        <p>
          The field of KEPLMs shares significant overlap with the research area of domain adaptation. As
already briefly mentioned, domain adaptation is concerned with retroactively fitting general
pre-trained LMs to a domain-dependent task. Some of the approaches in creating KEPLMs are
quite similar, which is when structured knowledge is used to retroactively adapt a LM instead
of influencing pre-training or inference. In domain adaptation, similar methods are for example
continued MLM pre-training [
          <xref ref-type="bibr" rid="ref13">11</xref>
          ] or employing diferent masking strategies [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ].
        </p>
        <p>While these fields share the aspect of subsequent model fitting, KEPLMs prioritize using
structured input data regardless of domain. Much work in this area focuses on improving
factual or common sense related knowledge, not least because this is where most of the structured
resources are digitally available (most importantly Wikidata and ConceptNet). Domain
adaptation focuses more on solving the domain specific task, regardless of additional input used.
This paper seeks make this connection explicit and set an example for the combination of the
two fields, namely using a KG to adapt a LM to the historical German knowledge domain.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Infusing Language Models with Historical Encyclopedic</title>
    </sec>
    <sec id="sec-4">
      <title>Knowledge</title>
      <sec id="sec-4-1">
        <title>3.1. Workflow Overview</title>
        <p>
          A schematic representation of the proposed workflow can be found in Figure 1. As [
          <xref ref-type="bibr" rid="ref22 ref38 ref43">17, 42, 37,
21</xref>
          ] have demonstrated, randomwalks can be used to infuse LMs with new information, or, in
the case of [
          <xref ref-type="bibr" rid="ref18">16</xref>
          ], even be the sole information source to build type embeddings. In this paper,
the method for randomwalk creation was borrowed and adapted from1[
          <xref ref-type="bibr" rid="ref8">7</xref>
          ]. All triples were
extracted from Meyers’ knowledge graph, where the predicates were resolved to simple German.
As the original graph uses Wikidata properties, their German aliases were used for the
verbalization (e.g. ”P5973” to ”Synonym”). [
          <xref ref-type="bibr" rid="ref18">16</xref>
          ] have shown that a non verbalization worked the
best in their case, however as LMs process whole sentences, this simple verbalization method
was chosen here instead. The triples were parsed with networkX, and node2vec was used to
create the walks. The procedure was slightly adapted from [17]: More unspecific relations,
particularly ”related to,” were assigned a lower edge weight to reduce their probability of being
selected during walks. Additionally, multiword expressions were not combined with
underscores in this case. In total, 752,230 randomwalks were created. Examples can be found in
Table 4 in the Appendix.
        </p>
        <p>
          As a starting point, the current German state-of-the-art for encoder-decoder based models,
gBERT-large,4 was used, and an adapter using the LoRA [
          <xref ref-type="bibr" rid="ref16">14</xref>
          ] configuration was added. For
this training setup, this means that only 0.234% of the original parameter size had to be trained
(about 786K instead of about 335M). Using the encyclopedia’s original text (see examples in
Table 5), one adapter was trained on the MLM task. Then, another adapter was trained
separately on the randomwalk KG representation of the same encyclopedia. Both adapters were
trained using the same hyperparameters each, which are 8 epochs, MLM probability of 0.15,
and learning rate of 1e-4. Additionally, the model’s perplexity 3[
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] during the randomwalk
training was calculated on a sample of the OSCAR dataset (used for the pre-training of gBERT)
over the course of 24 epochs (see Appendix4). Here, it can be seen that even though the use
of an adapter should mitigate forgetting pre-trained knowledge, the perplexity increases quite
steadily for OSCAR. However, it also declines for the randomwalks, confirming that the model
is improving on this dataset during training. This shows that there is still a trade-of, and
training with the randomwalk corpus should not be extended beyond a certain point, which is why
the training was stopped at epoch 8.
        </p>
        <p>The evaluation procedure relies on predicting the correct word from a given word plus word
relation using the fill-mask pipeline. The creation of these word pair datasets is described in
the following section. Using the [MASK]-token and a verbalization of the expected relation,
the LM is prompted to predict the second word of the pair. Some examples can be found in
Table 6 in the Appendix. Then, the performance is calculated by the correct hits within the top
predictions of the LM. Other evaluation methods focus on embedding extraction of word types
by fusing the token embeddings from multiple sentences and measuring the relationship via
cosine distance. As the embedding method could be volatile to sentence sampling, and could
potentially conflate the diferent dimensions of word ”closeness” through just cosine distance,
the evaluation strategy used here seeks to negate the randomness through sampling and takes
the nuances of word relations into account.</p>
      </sec>
      <sec id="sec-4-2">
        <title>3.2. Creation of the Evaluation Dataset</title>
        <p>
          The evaluation consists of probing the original and knowledge infused LMs on diferent
knowledge types, which is meant to assess which information can actually be ingested with the
proposed workflow: factual, linguistic, and lexical semantic knowledge. Several diferent datasets
consisting of word pairs were constructed to cover these three tasks. The word pairs were
extracted with regular expressions from another German encyclopedia of the same time period
(Brockhaus) to make sure that the historical variation and text genre of the input encyclopedia
also matches the evaluation data. The decision to use two diferent encyclopedias for
training and testing stems from the nature of the task, which is not about generalizing knowledge,
but rather about learning specific, encyclopedic relations such as synonyms and factual
associations. Unlike more general language tasks, the relations captured in an encyclopedia –
especially those pertaining to domain-specific knowledge – are inherently difÏcult to
generalize beyond their specific context. By testing on a second encyclopedia, Brockhaus, the aim is to
evaluate how well the model has internalized and can retrieve the learned relationships rather
than generalizing abstract patterns; a similar approach to earlier model ”semantic retrofitting”
methods [
          <xref ref-type="bibr" rid="ref34">8, 33</xref>
          ].
        </p>
        <p>For the evaluation data, 5 diferent types of word pair lists were constructed: people and their
year of birth,5 places and where they are located, words and their language of origin, pairs of
synonyms, and definitions of concepts (also referred to as is-a relation or hypernyms). The first
two datasets represent factual knowledge, the third dataset represents linguistic knowledge,
and the last two represent lexical semantic knowledge.</p>
        <p>
          However, the content of encyclopedias in general is not only historical, but at times
extremely detailed, as they do not only cover general knowledge but a lot of domain specific
knowledge as well, such as chemistry or botany, for instance. Similarly, some facts are
easier than others depending on how well known the entity in question is. Where possible, the
datasets were separated into two splits: simple and expert knowledge. For the dataset about
places, the population size along with the location were extracted. All places with a (historical)
population size exceeding 70,000 were added to the simple knowledge category. Places with a
population size between 30,000 and 70,000 were counted as expert knowledge. For both lexical
semantic datasets, GermaNet [
          <xref ref-type="bibr" rid="ref14">12</xref>
          ] was used to gauge the level of specificity of the word pairs. In
more precise terms, the corresponding synset was retrieved for the second word of the pair and
along with it its level in the hierarchy of GermaNet terms. From a psychological point of view,
a higher hypernym depth in the hierarchy would correspond with a higher specificity / expert
knowledge, while a more shallow depth would insinuate a simpler kind of knowledge. When
given more than one synset for one word, the minimum depth of these synsets was chosen.
The extracted hypernym depths of synonyms exhibit a mean of 7.71 (median of 7), while the
is-a pairs have a mean of 6.52 (median of 6). The diference is to be expected, as the definitions
should always indicate an upper hierarchical level in contrast to the synonyms. When
comparing the encyclopedia synonyms to another dataset commonly used for evaluating word-level
similarity (German translation of SimLex [
          <xref ref-type="bibr" rid="ref20">19</xref>
          ]), the ”expertness” of the encyclopedia becomes
apparent. The mean depth of SimLex word pairs is 5, meaning that on average, SimLex pairs
are 2 hierarchy levels above encyclopedia pairs (see distribution comparisons in Figur2e).
        </p>
        <p>As a result, all word pairs where the predicted word has a hypernym depth of 6 or lower
were categorized as simple, and 7 and up counted as expert6. For the linguistic and year of
birth datasets, the data were not split because no immediate additional feature for separation
could be identified. All created datasets can be found on github.7
5The birth dates in theBrockhaus dataset exhibit a median of 1811 and are highly skewed, with a long tail extending
back to the year 1000. The 25th percentile (Q1) is 1757, and the 75th percentile (Q3) is 1835.
6While two separate thresholds could have been introduced here, a single ”expertness” threshold ensures a reliable
comparison across both datasets. It is based on the assumption that a lexeme’s ”expertness” level should not change
depending on whether it appears in a synonym or hypernym context.
7https://github.com/ThoraHagen/HistED/</p>
      </sec>
      <sec id="sec-4-3">
        <title>3.3. Results</title>
        <p>The results of both encyclopedia and encyclopedia-KG adapted models can be found in Table
1. For evaluation, hits@n were calculated across all datasets except foryear of birth, where the
average prediction error (distance from the true year) was used instead. The hits@n metric
relfects the proportion of correct answers ranked in the top n model fill-mask predictions, where
higher values indicate better performance. Overall, the KG-based domain adaptation is able to
outperform the full-text adaptation, though the degree of improvement varies across tasks.
Factual Knowledge Both evaluation datasets representing factual knowledge exhibit some
improvement, albeit minor. As indicated above, for theyear of birth dataset, the evaluation
focused on the average diference between the actual year of birth and the top 3 predicted
years, because the hits@n metric for all three models yielded near-zero scores, even when
n was set to large values. This approach provides a better sense of how close the models’
predictions were to the correct year, given the low performance in ranking accuracy. It can
be seen that even though the model makes a somewhat better educated guess (as in ”in an
encyclopedia published in 1905 there should not be any birthdates mentioned before that”) as
the average distance is reduced by about 50 years, the precision is still poor. A qualitative
review of the prompts did not find any correlation between correct guesses and a person’s
fame (as fame may reflect both simple and expert knowledge in this case). Further work on
quantifying fame and splitting the dataset accordingly is necessary to confirm this notion. A
similar sentiment can be observed with thelocation datasets. Both simple and expert location
knowledge exhibit minor improvements of about 1-4 percentage points (pp.) more hits@10.
One possible explanation could be that the majority of information about locations is already
contained through the pre-training of gBERT-large, and not many evaluation examples contain
information that changed until today. The location dataset is quite fine-grained, meaning that
rather than countries, smaller regions are given as the true label, which also afects the exact
prediction accuracy. A qualitative examination of some evaluation instances show that more
sensible location predictions were made overall, even if the exact label is not predicted (see
Appendix 6). However, similar to thebirthyear dataset, the accuracy is very low.
Linguistic Knowledge The task for assigning the origin language to a word represents
linguistic knowledge in this setup (for exampleAbsolut and Latin). Because the outcome space
of the prediction is presumably much more limited than for the other datasets, the evaluation
setup was narrowed to hits@3. The observed improvements for both gBERT ency and gBERT
ency-KG are quite high, with 13 and 23 pp. more hits respectively. Out of all datasets, the
improvements are the highest here. However, it needs to be addressed that this dataset is
quite imbalanced, as most true labels are either French, Latin, or Greek, meaning that the
improvements seen could just be the nature of a language distribution shift. Other words with
a diferent language of origin may not be predicted as well. In terms of a historical domain
adaptation, it still can be said that the method performs as intended: It is more likely that a
word in a German historical encyclopedia stems from one of these three languages, which is
exactly what the dataset reflects.</p>
        <p>Lexical Semantic Knowledge While for both synonym andis-a relations some
improvements can be observed, the two datasets perform quite diferently. Firstly, the is-a relations
outperform the synonyms, with about 7 pp. more hits for the former and merely 2 pp. more
hits for the latter concerning the simple relations. Secondly, both lexical expert variants
fall behind their simple counterparts, with only about a 4 pp. diference for definitions and
a 1 pp. diference for synonyms. Both results indicate that simpler lexical knowledge is
more beneficial to language models than expert lexical knowledge. One could assume that
the simpler knowledge would already be contained in gBERT through the OSCAR dataset
pre-training, and that the injection would benefit the representation of specialized knowledge
more, so this is a surprising result.</p>
        <p>
          In summary, it can be said that 1) factual knowledge shows a trend towards improvement but
lacks the specificity that these two datasets demand, 2) linguistic knowledge shows greater
improvements, however this result may stem from a simple distribution shift, 3) lexical knowledge
shows greater improvements for the upper hierarchy level ofis-a relations while synonyms are
harder to predict. Across the datasets but especially for lexical knowledge, simple knowledge
still bears more room for improvement, while expert knowledge is harder to ingest. This may
seem surprising, as previous studies have demonstrated that language models already possess,
or have largely mastered, basic semantic knowledge [
          <xref ref-type="bibr" rid="ref26 ref7">25, 6</xref>
          ].
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. Lexical Semantic Knowledge for LM Infusion</title>
      <p>In this section of the paper, the focus therefore lies on confirming whether LMs still have room
for improvement for contemporary lexical semantic knowledge by removing two confounding
factors from the previous experiment: original text type (encyclopedia) and historicity. This
is why instead of the encyclopedia KG, WordNet is used for LM injection in the following
experiment.</p>
      <p>
        Concerning KEPLMs, many studies have been conducted that evaluated mostly on factual
knowledge, task-based common sense or domain knowledge, and lexical-based studies are
rather rare (see [
        <xref ref-type="bibr" rid="ref17">15</xref>
        ] for an overview of recent KEPLM studies). To the best of available
knowledge, there are no studies that explicitly evaluate the upper limit of lexical semantic knowledge
improvement for randomwalk-fitted LMs. Similar studies focusing on lexically informed LMs
are most importantly LIBERT [18] and Mirror-BERT [
        <xref ref-type="bibr" rid="ref23">22</xref>
        ]. LIBERT introduces a new
classification loss during pre-training based on whether a given tuple holds a semantic relation using
WordNet plus Roget’s Thesaurus. The authors evaluate on the GLUE benchmark, where the
focus lies on sentence level semantics, as well as the lexical simplification task, a variant of
assessing word level similarity using context from sentences. Mirror-BERT does not rely on
external data but instead introduces text corruption, where the model learns to cluster true
and false (corrupted) text samples. Evaluation is based on sentence level and word level tasks,
including word level similarity.
      </p>
      <p>Diferent from LIBERT and Mirror-BERT (aside from the injection method), this section also
takes diferent model sizes into account and evaluates on three diferent lexical tasks:
association, similarity and entailment.</p>
      <sec id="sec-5-1">
        <title>4.1. Methodology</title>
        <p>A visualization of the WordNet workflow can be found in Figure 3. First, all triples were
extracted from the WordNet database. In a second step, the relations were verbalized to mimic
natural language, e.g. ”synonym” to ”is a synonym of.” Again, the verbalized triples were parsed
with networkX and the node2vec algorithm was used to create 258,239 randomwalks. Some
examples of WordNet randomwalks can again be seen in Table4 in the Appendix.</p>
        <p>
          To evaluate the retrofitting efectiveness for lexical semantics in particular, five datasets were
chosen as stand-ins for three diferent lexical semantic tasks: SimLex [
          <xref ref-type="bibr" rid="ref15">13</xref>
          ] and SimVerb [9] for
evaluating semantic word similarity, WordSim 1[] and MEN [
          <xref ref-type="bibr" rid="ref6">5</xref>
          ] for semantic word relatedness,
and HyperLex [
          <xref ref-type="bibr" rid="ref39">38</xref>
          ] for evaluating lexical entailment. All datasets are score-annotated word
pairs, e.g. on a scale of 0 to 10,happy and cheerful score a similarity of 9.55 (SimLex).
        </p>
        <p>Because these datasets represent the strength of one semantic relationship between two
words rather than a binary relation, the evaluation method was slightly adapted in this
experiment. Similar to before, by using the fill-mask strategy, the evaluation focuses on probing each
language model on relation prediction given the first word of each pair. However, the top 100
words are predicted here as compared to the previous experiment. Here, the inverse indices
of all word pair matches are compared to the true dataset scores using Spearman’s correlation.
For example, the RoBERTa-large model predictscheerful from the task ”happy is a synonym of
&lt;mask&gt;” at rank 5, which would translate to a similarity score of 95. In other words, scores are
assigned to word pairs by their prediction ranking of each model.</p>
        <p>
          Multiple models of diferent parameter sizes as well as pre-training text-sizes are compared
to assess how the method scales with these model diferences. Similar to the experiment
before, only LoRA-adapters were trained instead of the whole model. In comparison to the
BERTfamily of encoder-decoders, Llama-38 was also included as a point of reference for large
language decoder-only models. To match the evaluation strategy of predicting a single word, the
instruct variant (specialized on adapting to user generated tasks) was chosen over the chat
variant (specialized on text generation). Here, an adapter was prompt-trained to predict the object
given a subject and predicate statement using WordNet triples, similar to the fill-mask task of
encoder-decoders (for a similar approach see 3[
          <xref ref-type="bibr" rid="ref7">6</xref>
          ]). An overview of all models can be found
in Table 2. All WordNet adapters were trained three separate times to mitigate possible model
unstableness9 due to the random weight initialization, and the mean Spearman’s correlations
8https://huggingface.co/meta-Llama/Meta-Llama-3-8B-Instruct. The model weights were cast tobfloat16 for
memory efÏciency (low precision training).
9The results exhibit a mean standard deviation of 0.01. Standard deviations were calculated per dataset and model.
DistilBERT-base
DistilBERT-base WN
Δ
BERT-base
BERT-base WN
Δ
BERT-large
BERT-large WN
Δ
DistilRoBERTa-base
DistilRoBERTa-base WN
Δ
RoBERTa-base
RoBERTa-base WN
Δ
RoBERTa-large
RoBERTa-large WN
Δ
Llama-3-8B-instruct
Llama-3-8B-instruct WN
Δ
across these three adapters per model are reported.
        </p>
      </sec>
      <sec id="sec-5-2">
        <title>4.2. Results</title>
        <p>The results of the WordNet adapted models can be found in Table3. For all the models, the
injection of WordNet is able to benefit the representation of lexical semantics using a
randomwalkadapter.</p>
        <p>Overall, a higher parameter size is beneficial for this approach, not only in terms of the
generally best performing models, but highest performance jumps as well. The two word relatedness
tasks, WordSim and MEN, do better on models trained with less text, while word similarity and
lexical entailment do better on the models trained with more text. An indicator for this kind
of separation could be the clearness of the evaluated relation: While semantic relatedness
indicates the degree of association between two words, semantic similarity indicates the degree of
synonymy and lexical entailment the degree of hypernymy. Compared to the latter two, the
former is a much more fuzzy concept. This could indicate that with increasing parameter size,
more refined relations can be better represented instead of just word association. In the case
MEN
of hypernymy, which is not a symmetrical relation compared to the other two tasks,
RoBERTalarge has the largest overall performance and largest performance diference. The same trend
can be found in the non-fitted versions of the models. For RoBERTa-large, both similarity and
entailment are already represented significantly better in comparison to the mean of the other
models, while the performance on relatedness is comparable to the others. Concerning the
Llama model, even though it is also showing signs of improvement, it cannot compare to the
encoder-decoder-based models in this setup. A similar trend like for the large models shows
however, which is that on average, associations show the least improvement, followed by
similarity, and finally entailment benefits the most. The contrast to the other models may stem
from the diferences in model pre-training and not necessarily because of size diferences only.
Further studies will be needed to explore how to better tailor the lexical adapter approach to
decoder-only models.</p>
        <p>
          The results indicate that more refined tasks, here lexical entailment, benefit more from the
increased model size, while the less precise association task shows more stagnation across
different model sizes. In terms of the pre-training corpus size, the results are less intuitive. The
distilled variant of RoBERTa does not show any significant advantage over its BERT
counterpart. For the base variant, again, only synonyms and entailment show minor improvements
over BERT-base. When using WordNet randomwalks for creating a lexically informed LM,
it can be seen that models with more parameters benefit from the method for synonym and
entailment relations. Corpus size may only matter when both parameter and text size are
comparatively high. For word association, the performance diferences are generally not as
high and the task shows a negative correlation with original corpus size. The assumption that
larger models already contain the majority of lexical knowledge and do not benefit from lexical
injections is therefore not true, and the results align with previous studies in this regard1[
          <xref ref-type="bibr" rid="ref23">8,
22</xref>
          ].
        </p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>5. Summary and Outlook</title>
      <p>In summary, this paper has shown that extracting a KG from a resource can be helpful when
domain-adapting a general LM to a historically informed LM. The models were evaluated with
a two-dimensional approach: one categorical dimension for the type of knowledge (factual,
linguistic, lexical semantics) and another binary dimension to distinguish simple from expert
knowledge. The main finding is that surprisingly, simple knowledge still bears the most
potential for improvement while expert knowledge falls behind. The WordNet follow-up study
confirmed that language models can still be enhanced with simple lexical knowledge.</p>
      <p>Regarding the question of whether a historical encyclopedia is suited for historical domain
adaptation, it can be said that that it depends on the use case of the language model.
Encyclopedias contain specialized knowledge because the diverse fields of expertise discussed, as
well as the historical perspective, directly influence its lexical richness. Thus, encyclopedias
contain specialized knowledge also in terms of expert semantic relations. When using the
approach discussed here, one should target more precisely what kind of knowledge to inject.
Using the entire knowledge graph may not send strong enough signals for fitting a specific
task. Here, broad signals such as simple definitions and language of word origin showed the
best results, while synonyms especially could not be represented as efectively. Employing a
knowledge graph, future work could therefore explore multiple ways of limiting the training
data to either specific relations (e.g. to target synonyms only) or historical knowledge domains.
When controlling for these two confounding factors (expert domains discussed plus historical
expertise) using WordNet, it can be observed that the same method is capable of injecting
contemporary lexical knowledge such as synonymy into LMs, where even the larger models
generally perform better.</p>
      <p>In future work, concerning model analysis, the test suite for the encyclopedic evaluation will
be diversified more. Currently, a binary classification of simple and expert knowledge,
determined through an automatic approach using GermaNet, is being used. However, the dataset
might exhibit a more intuitive notion of expert knowledge when manually annotating and
deriving a continuous score from the annotations. Additionally, more relations will be added to
the dataset to ensure that the results do not stem from peculiarities of the chosen relation and
better represent the overall task.</p>
      <p>There are more nuances to model training in this study that have not been taken into account
yet. For one, the hyperparameters have been kept stable for the entirety of the experiments
to ensure comparability between models. Potentially, this means that the upper bound of the
KG injection models have not been reached. Another question to pursue would be how this
method transfers to other tasks based on sentences. Instead of MLM adapters, the training
of task-based adapters such as NLI is also possible. In future work, the evaluation could then
also focus on how stacking both the KG adapter and another task-trained adapter (with both
adapters activated during inference) could influence task performance. The hypothesis could
be that certain tasks that rely on lexical information, such as sentiment prediction or semantic
textual similarity, could also benefit from WordNet, for example. Finally, future work will also
aim to better understand the diferences between encoder-decoder and decoder-only language
models. The disparities in pre-training (MLM vs. causal language modeling) may have
significant impacts on infusing these models with more knowledge. Therefore, diferent injection
strategies or prompting strategies will need to be compared to better assess the possibilities of
knowledge-enhanced pre-trained LLMs.</p>
      <p>A. Lauscher, O. Majewska, L. F. Ribeiro, I. Gurevych, N. Rozanov, and G. Glavaš.
“Common Sense or World Knowledge? Investigating Adapter-Based Knowledge Injection into
Pretrained Transformers”. In:Proceedings of Deep Learning Inside Out (DeeLIO): The First
Workshop on Knowledge Extraction and Integration for Deep Learning Architectures. 2020,
pp. 43–49.
[43] T. Zhang, C. Wang, N. Hu, M. Qiu, C. Tang, X. He, and J. Huang. “DKPLM: decomposable
knowledge-enhanced pre-trained language model for natural language understanding”.
In: Proceedings of the AAAI Conference on Artificial Intelligence . Vol. 36. 2022, pp. 11703–
11711.</p>
    </sec>
    <sec id="sec-7">
      <title>A. Perplexity</title>
    </sec>
    <sec id="sec-8">
      <title>B. Randomwalk Examples</title>
      <p>unmentionable is similar to impermissible. impermissible is similar to tabu. tabu is a
synonym of proscribed. proscribed is a synonym of forbidden. forbidden is similar to
impermissible. impermissible is similar to proscribed. proscribed is a synonym of
prohibited.
albuterol is a bronchodilator. bronchodilator is a medication. medication is a synonym of
medicinal_drug. medicinal_drug is a synonym of medication. medication is a synonym of
medicament. medicament is a synonym of medicinal_drug. medicinal_drug is a drug.
Kirrung verwandter Begrif Ankörnen. Ankörnen verwandter Begrif Blasenfüßer.
Blasenfüßer Hyperonym gelbbraune Dracänenblasenfuß. gelbbraune Dracänenblasenfuß
Hyponym Thrips. Thrips Definition Insektengruppe.</p>
      <p>Synonymenwörterbuch verwandter Begrif Wörterbuch. Wörterbuch verwandter Begrif
Handwörterbuch. Handwörterbuch verwandter Begrif Frerichs. Frerichs Synonym
Friedrich Theodor Frerichs. Friedrich Theodor Frerichs geboren 24. März 1819.</p>
      <p>Blasenfüßer (Physopoda, Thysanoptera), Insektengruppe von sehr zweifelhafter Stellung im
System, wird zu den Falschnetzflüglern gestellt und umfaßt winzige Tierchen mit zylindrischem
Kopf, saugenden Mundwerkzeugen, sehr schmalen, stark befransten Flügeln, die bisweilen auch
fehlen, und runden Hastscheiden statt der Klauen an den Füßen. Die B. leben auf Blättern,
nehmen die zarte Oberhaut derselben weg und erzeugen dadurch oft bedeutenden Schaden. [...]
Wörterbuch (Lexikon), ein in rein alphabetischer oder alphabetisch-etymologischer Ordnung
verfaßtes Verzeichnis von Wörtern und Eigennamen (welch letztere aber bisweilen fehlen oder
ein besonderes W. bilden) mit oder ohne beigefügte Erklärung in der nämlichen oder in einer
andern Sprache. [...]</p>
    </sec>
    <sec id="sec-9">
      <title>D. Example Predictions from the Fill-Mask Pipeline</title>
      <p>verbalization
(Leonian contract) is
a [MASK].
(Maidstone) is
located in [MASK].
(Toll) is a synonym of
[MASK].
(William George
Armstrong) was born
in year [MASK].
(Accurate) is a word
from the language
[MASK].
model
gBERT
gBERT
gBERT
gBERT ency-KG
gBERT ency-KG
gBERT ency-KG
gBERT</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>E.</given-names>
            <surname>Agirre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Alfonseca</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Hall</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kravalova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pasca</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Soroa</surname>
          </string-name>
          . “
          <article-title>A study on similarity and relatedness using distributional and wordnet-based approaches”. InP:roceedings of human language technologies: The 2009 annual conference of the north american chapter of the association for computational linguistics</article-title>
          .
          <source>2009</source>
          , pp.
          <fpage>19</fpage>
          -
          <lpage>27</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Aragon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. P. L.</given-names>
            <surname>Monroy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Gonzalez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Losada</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Montes</surname>
          </string-name>
          . “
          <article-title>DisorBERT: A double domain adaptation model for detecting signs of mental disorders in social media”.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <article-title>In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)</article-title>
          .
          <year>2023</year>
          , pp.
          <fpage>15305</fpage>
          -
          <lpage>15318</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Cahyawijaya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Wilie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lovenia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Chung</surname>
          </string-name>
          , et al. “
          <article-title>A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity”</article-title>
          .
          <source>In:Proceedings of the 13th International Joint Conference</source>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>F. A.</given-names>
            <surname>Brockhaus</surname>
          </string-name>
          , ed.
          <source>Brockhaus' Kleines Konversations-Lexikon. 5th ed. Leipzig: Brockhaus</source>
          ,
          <year>1911</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>E.</given-names>
            <surname>Bruni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. K.</given-names>
            <surname>Tran</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Baroni</surname>
          </string-name>
          . “Multimodal Distributional Semantics”.
          <source>InJ:ournal of Artificial Intelligence Research</source>
          <volume>49</volume>
          (
          <year>2014</year>
          ), pp.
          <fpage>1</fpage>
          -
          <lpage>47</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>T. A.</given-names>
            <surname>Chang</surname>
          </string-name>
          and
          <string-name>
            <given-names>B. K.</given-names>
            <surname>Bergen</surname>
          </string-name>
          . “
          <article-title>Language model behavior: A comprehensive survey”</article-title>
          .
          <source>In: Computational Linguistics 50.1</source>
          (
          <issue>2024</issue>
          ), pp.
          <fpage>293</fpage>
          -
          <lpage>350</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <source>[7] [8] [9]</source>
          [10]
          <string-name>
            <given-names>D. Coelho</given-names>
            <surname>Mollo</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Millière</surname>
          </string-name>
          . “
          <article-title>The vector grounding problem”</article-title>
          .
          <source>Ina:rXiv preprint arXiv:2304.01481</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>M.</given-names>
            <surname>Faruqui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dodge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. K.</given-names>
            <surname>Jauhar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Dyer</surname>
          </string-name>
          , E. Hovy, and
          <string-name>
            <given-names>N. A.</given-names>
            <surname>Smith</surname>
          </string-name>
          . “
          <article-title>Retrofitting word vectors to semantic lexicons”</article-title>
          .
          <source>In:NAACL HLT</source>
          <year>2015</year>
          <article-title>- 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</article-title>
          ,
          <source>Proceedings of the Conference i (2015)</source>
          , pp.
          <fpage>1606</fpage>
          -
          <lpage>1615</lpage>
          . doi:
          <volume>10</volume>
          . 3115 / v1 / n15 -
          <fpage>1184</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <source>arXiv: 1411</source>
          .
          <fpage>4166</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>D.</given-names>
            <surname>Gerz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Vulić</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Hill</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Reichart</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Korhonen</surname>
          </string-name>
          . “SimVerb-3500:
          <string-name>
            <given-names>A</given-names>
            <surname>Large-Scale Evaluation</surname>
          </string-name>
          <article-title>Set of Verb Similarity”</article-title>
          .
          <source>In:Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing</source>
          .
          <year>2016</year>
          , pp.
          <fpage>2173</fpage>
          -
          <lpage>2182</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <given-names>A.</given-names>
            <surname>Grover</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Leskovec</surname>
          </string-name>
          . “node2vec:
          <article-title>Scalable feature learning for networks”</article-title>
          .
          <source>In:Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining</source>
          .
          <source>2016</source>
          , pp.
          <fpage>855</fpage>
          -
          <lpage>864</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S.</given-names>
            <surname>Gururangan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Marasović</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Swayamdipta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Beltagy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Downey</surname>
          </string-name>
          , and
          <string-name>
            <given-names>N. A.</given-names>
            <surname>Smith</surname>
          </string-name>
          . “
          <string-name>
            <surname>Don't Stop</surname>
          </string-name>
          <article-title>Pretraining: Adapt Language Models to Domains and Tasks”. InP: roceedings of the 58th Annual Meeting of the Association for Computational Linguistics</article-title>
          .
          <year>2020</year>
          , pp.
          <fpage>8342</fpage>
          -
          <lpage>8360</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>B.</given-names>
            <surname>Hamp</surname>
          </string-name>
          and
          <string-name>
            <given-names>H.</given-names>
            <surname>Feldweg</surname>
          </string-name>
          . “
          <article-title>GermaNet-a lexical-semantic net for German”</article-title>
          .
          <source>In:Proceedings of the ACL workshop Automatic Information Extraction and Building of Lexical Semantic Resources for NLP Applications</source>
          .
          <year>1997</year>
          , pp.
          <fpage>9</fpage>
          -
          <lpage>15</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>F.</given-names>
            <surname>Hill</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Reichart</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Korhonen</surname>
          </string-name>
          . “SimLex-999:
          <article-title>Evaluating Semantic Models With (Genuine) Similarity Estimation”</article-title>
          .
          <source>In:Computational Linguistics 41.4</source>
          (
          <issue>2015</issue>
          ), pp.
          <fpage>665</fpage>
          -
          <lpage>695</lpage>
          . doi:
          <volume>10</volume>
          .1162/COLI\_a\_
          <volume>00237</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>E. J.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wallis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Allen-Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Chen</surname>
          </string-name>
          , et al. “
          <article-title>LoRA: LowRank Adaptation of Large Language Models”</article-title>
          .
          <source>In:International Conference on Learning Representations</source>
          .
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>L.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Hou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Nie</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          .
          <article-title>“A survey of knowledge enhanced pre-trained language models”</article-title>
          .
          <source>In:IEEE Transactions on Knowledge and Data Engineering</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>F.</given-names>
            <surname>Klubička</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Maldonado</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mahalunkar</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Kelleher</surname>
          </string-name>
          . “
          <article-title>English wordnet random walk pseudo-corpora”</article-title>
          .
          <source>In:Proceedings of the Twelfth Language Resources and Evaluation Conference</source>
          .
          <year>2020</year>
          , pp.
          <fpage>4893</fpage>
          -
          <lpage>4902</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <given-names>A.</given-names>
            <surname>Lauscher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Vulić</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Ponti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Korhonen</surname>
          </string-name>
          , and
          <string-name>
            <surname>G. Glavaš.</surname>
          </string-name>
          “
          <article-title>Specializing unsupervised pretraining models for word-level semantic similarity”</article-title>
          . In:arXiv preprint arXiv:
          <year>1909</year>
          .
          <volume>02339</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [19]
          <string-name>
            <surname>I. Leviant</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Reichart</surname>
          </string-name>
          .
          <article-title>Separated by an Un-common Language: Towards Judgment Language Informed Vector Space Modeling</article-title>
          .
          <year>2015</year>
          . arXiv:
          <volume>1508</volume>
          .00106 [cs.CL].
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Levine</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Lenz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Dagan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Ram</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Padnos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Sharir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shalev-Shwartz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Shashua</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shoham</surname>
          </string-name>
          . “
          <article-title>SenseBERT: Driving Some Sense into BERT”</article-title>
          . In:
          <article-title>Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics</article-title>
          . Ed. by
          <string-name>
            <given-names>D.</given-names>
            <surname>Jurafsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Schluter</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Tetreault</surname>
          </string-name>
          . Online: Association for Computational Linguistics,
          <year>2020</year>
          , pp.
          <fpage>4656</fpage>
          -
          <lpage>4667</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .acl-main.
          <volume>423</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Mao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <surname>and E. Cambria.</surname>
          </string-name>
          “
          <article-title>Fusing topology contexts and logical rules in language models for knowledge graph completion”</article-title>
          .
          <source>In:Information Fusion</source>
          <volume>90</volume>
          (
          <year>2023</year>
          ), pp.
          <fpage>253</fpage>
          -
          <lpage>264</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>F.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Vulić</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Korhonen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>N.</given-names>
            <surname>Collier</surname>
          </string-name>
          . “Fast, Efective, and Self-Supervised:
          <article-title>Transforming Masked Language Models into Universal Lexical and Sentence Encoders”</article-title>
          .
          <source>In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing</source>
          .
          <year>2021</year>
          , pp.
          <fpage>1442</fpage>
          -
          <lpage>1459</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>W.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Ju</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Deng</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Wang. “</surname>
          </string-name>
          K-BERT:
          <article-title>Enabling language representation with knowledge graph”</article-title>
          .
          <source>In:Proceedings of the AAAI Conference on Artificial Intelligence</source>
          . Vol.
          <volume>34</volume>
          .
          <year>2020</year>
          , pp.
          <fpage>2901</fpage>
          -
          <lpage>2908</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>R.</given-names>
            <surname>Logan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. F.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. E.</given-names>
            <surname>Peters</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gardner</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Singh</surname>
          </string-name>
          . “
          <article-title>Barack's Wife Hillary: Using Knowledge Graphs for Fact-Aware Language Modeling”</article-title>
          . In:
          <article-title>Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics</article-title>
          . Ed. by
          <string-name>
            <given-names>A.</given-names>
            <surname>Korhonen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Traum</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Màrquez</surname>
          </string-name>
          . Florence, Italy: Association for Computational Linguistics,
          <year>2019</year>
          , pp.
          <fpage>5962</fpage>
          -
          <lpage>5971</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>P19</fpage>
          -1598.
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>K.</given-names>
            <surname>Mahowald</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Ivanova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. A.</given-names>
            <surname>Blank</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kanwisher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. B.</given-names>
            <surname>Tenenbaum</surname>
          </string-name>
          , and
          <string-name>
            <surname>E. Fedorenko.</surname>
          </string-name>
          “
          <article-title>Dissociating language and thought in large language models”</article-title>
          .
          <source>InT:rends in Cognitive Sciences</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>E.</given-names>
            <surname>Manjavacas</surname>
          </string-name>
          and
          <string-name>
            <given-names>L.</given-names>
            <surname>Fonteyn</surname>
          </string-name>
          . “
          <article-title>Adapting vs. pre-training language models for historical languages”</article-title>
          .
          <source>In: Journal of Data Mining &amp; Digital Humanities</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [27] J. Meyer, ed.
          <source>Meyers Großes Konversations-Lexikon. 6th ed. Leipzig: Bibliographisches Institut</source>
          ,
          <fpage>1905</fpage>
          -
          <lpage>1909</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>S.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>X.</given-names>
            <surname>Wu</surname>
          </string-name>
          . “
          <article-title>Unifying large language models and knowledge graphs: A roadmap”</article-title>
          .
          <source>In:IEEE Transactions on Knowledge and Data Engineering</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Qin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Takanobu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sun</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhou</surname>
          </string-name>
          . “ERICA:
          <article-title>Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive Learning”</article-title>
          .
          <source>In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing</source>
          (Volume
          <volume>1</volume>
          : Long Papers). Ed. by
          <string-name>
            <given-names>C.</given-names>
            <surname>Zong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Li</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Navigli</surname>
          </string-name>
          . Online: Association for Computational Linguistics,
          <year>2021</year>
          , pp.
          <fpage>3350</fpage>
          -
          <lpage>3363</lpage>
          .
          <year>doi1</year>
          :
          <fpage>0</fpage>
          .18653/v 1/
          <year>2021</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>260</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [30]
          <string-name>
            <surname>M.-L. Ryan</surname>
          </string-name>
          . “
          <article-title>Fiction, non-factuals, and the principle of minimal departure”</article-title>
          .
          <source>In:Poetics 9.4</source>
          (
          <issue>1980</issue>
          ), pp.
          <fpage>403</fpage>
          -
          <lpage>422</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>S.</given-names>
            <surname>Serrano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Brumbaugh</surname>
          </string-name>
          , and
          <string-name>
            <given-names>N. A.</given-names>
            <surname>Smith.</surname>
          </string-name>
          <article-title>“Language Models: A Guide for the Perplexed”</article-title>
          .
          <source>In: arXiv preprint arXiv:2311.17301</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>T.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Mao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Long</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Trischler</surname>
          </string-name>
          , and
          <string-name>
            <given-names>W.</given-names>
            <surname>Chen</surname>
          </string-name>
          . “
          <article-title>Exploiting Structured Knowledge in Text via Graph-Guided Representation Learning”</article-title>
          .
          <source>In:Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          . Ed. by
          <string-name>
            <given-names>B.</given-names>
            <surname>Webber</surname>
          </string-name>
          , T. Cohn,
          <string-name>
            <given-names>Y.</given-names>
            <surname>He</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          . Online: Association for Computational Linguistics,
          <year>2020</year>
          , pp.
          <fpage>8980</fpage>
          -
          <lpage>8994</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .emnlp-main.
          <volume>722</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>R.</given-names>
            <surname>Speer</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Lowry-Duda</surname>
          </string-name>
          .
          <article-title>“ConceptNet at SemEval-2017 Task 2: Extending Word Embeddings with Multilingual Relational Knowledge”</article-title>
          .
          <source>In:Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)</source>
          .
          <year>2017</year>
          , pp.
          <fpage>85</fpage>
          -
          <lpage>89</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>T.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Qiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.-J.</given-names>
            <surname>Huang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          . “
          <article-title>CoLAKE: Contextualized Language and Knowledge Embedding”</article-title>
          .
          <source>In:Proceedings of the 28th International Conference on Computational Linguistics</source>
          .
          <year>2020</year>
          , pp.
          <fpage>3660</fpage>
          -
          <lpage>3670</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Tian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Tian</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Wu</surname>
          </string-name>
          . “Ernie:
          <article-title>Enhanced representation through knowledge integration”</article-title>
          . Ina:rXiv preprint arXiv:
          <year>1904</year>
          .
          <volume>09223</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. Y.</given-names>
            <surname>Sean</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jeon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kim</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Kang</surname>
          </string-name>
          . “
          <article-title>Can Language Models be Biomedical Knowledge Bases?”</article-title>
          <source>In:2021 Conference on Empirical Methods in Natural Language Processing</source>
          ,
          <string-name>
            <surname>EMNLP</surname>
          </string-name>
          <year>2021</year>
          .
          <article-title>Association for Computational Linguistics (ACL</article-title>
          ).
          <year>2021</year>
          , pp.
          <fpage>4723</fpage>
          -
          <lpage>4734</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lv</surname>
          </string-name>
          , W. Liu, and
          <string-name>
            <given-names>C.</given-names>
            <surname>Yang</surname>
          </string-name>
          . “
          <string-name>
            <surname>Walklm</surname>
          </string-name>
          :
          <article-title>A uniform language model finetuning framework for attributed graph embedding”</article-title>
          .
          <source>In:Advances in Neural Information Processing Systems</source>
          <volume>36</volume>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          [38]
          <string-name>
            <given-names>I.</given-names>
            <surname>Vulić</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Gerz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kiela</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Hill</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Korhonen</surname>
          </string-name>
          . “
          <article-title>HyperLex: A Large-Scale Evaluation of Graded Lexical Entailment”</article-title>
          .
          <source>In:Computational Linguistics 43.4</source>
          (
          <issue>2017</issue>
          ), pp.
          <fpage>781</fpage>
          -
          <lpage>835</lpage>
          . doi:
          <volume>10</volume>
          .1162/COLI\_a\_
          <volume>00301</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          [39]
          <string-name>
            <given-names>R.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Duan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.-J.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Jiang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhou</surname>
          </string-name>
          . “
          <string-name>
            <surname>K-Adapter</surname>
          </string-name>
          :
          <article-title>Infusing Knowledge into Pre-Trained Models with Adapters”. InF:indings of the Association for Computational Linguistics: ACL-IJCNLP</article-title>
          <year>2021</year>
          .
          <year>2021</year>
          , pp.
          <fpage>1405</fpage>
          -
          <lpage>1418</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          [40]
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and J.</given-names>
            <surname>Tang</surname>
          </string-name>
          . “KEPLER:
          <article-title>A unified model for knowledge embedding and pre-trained language representation”</article-title>
          .
          <source>InT:ransactions of the Association for Computational Linguistics</source>
          <volume>9</volume>
          (
          <year>2021</year>
          ), pp.
          <fpage>176</fpage>
          -
          <lpage>194</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          [41]
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Xiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Peng</surname>
          </string-name>
          . “
          <article-title>A survey of knowledge enhanced pre-trained models”</article-title>
          .
          <source>In:Journal of the Association for Computational Machinery 37.4</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref43">
        <mixed-citation>
          [42]
          <string-name>
            <given-names>C.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Song</surname>
          </string-name>
          , and
          <string-name>
            <given-names>W.</given-names>
            <surname>Ng</surname>
          </string-name>
          . “
          <article-title>CoCoLM: Complex Commonsense Enhanced Language Model with Discourse Relations”</article-title>
          .
          <source>In:Findings of the Association for Computational Linguistics: ACL</source>
          <year>2022</year>
          .
          <year>2022</year>
          , pp.
          <fpage>1175</fpage>
          -
          <lpage>1187</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>