1. Introduction

Domain Adaptation with Linked Encyclopedic Data: A Case Study for Historical German

Thora Hagen

0 0 Institut für Deutsche Philologie, Julius-Maximilians-Universität Würzburg , Germany

2023

1 443 461

This paper outlines a proposal for the use of knowledge graphs for historical German domain adaptation. From the EncycNet project, the encyclopedia-based knowledge graph from the early 20th century was borrowed to examine whether text-based domain adaptation using the source encyclopedia's text or graph-based adaptation produces a better domain-specific model. To evaluate the approach, a novel historical test dataset based on a second encyclopedia of the early 20th century was created. This dataset is categorized by knowledge type (factual, linguistic, lexical) with special attention paid to distinguishing simple and expert knowledge. The main finding is that, surprisingly, simple knowledge has the most potential for improvement, whereas expert knowledge lags behind. In this study, broad signals like simple definitions and word origin yielded the best results, while more specialized knowledge such as synonyms were not as efectively represented. A follow-up study was carried out in favor of simple contemporary lexical knowledge to control for historicity and text genre, where the results confirm that language models can still be enhanced by incorporating simple lexical knowledge using the proposed workflow.

eol>language models knowledge graphs encyclopedic knowledge semantics

1. Introduction

Based on Ryan’s principle of minimal departure [ 30 ], our understanding of any text is highly dependent on our previous knowledge of the world. Consequently, depending on the type of text, for example a technical paper, we would also need to be experts in the same scientific field to be able to follow the arguments made. Another example would be historical literature, where certain cues in the text could only be understood by having a solid foundation on societies, fashion, or politics (among other topics) of that exact time period. The same can be argued for language models (LMs). When working with texts of a specific topic, type, genre, time period, etc., the language model’s performance is also dependent on whether the training data matches the domain of the task at hand. In the case of digital humanities where, depending on the research domain, large text corpora may not be as readily available in comparison to contemporary English, the domain representation within the language model may not be stable enough. When employing a LM, researchers can either turn to a specialized pre-trained LM for the domain if available (e.g., MacBERTh [ 26 ] for historical English), or they have to perform domain adaptation of a general domain LM.

This paper explores how an encyclopedia-based knowledge graph (KG) can be used to adapt language models specifically for historical German, with a focus on injecting the knowledge from that period. The goal is to demonstrate a simple workflow for researchers in the digital humanities to infuse LMs with domain knowledge using a KG. Especially in the humanities, there may be specialized resources available, for example dictionaries, thesauri, or lexicons, which can be transformed into knowledge graphs (see for example projects LiL1a and PURA2). KGs provide another form of knowledge representation aside from text, and they generally ofer a wider variety of adaptation methods than text can. In this paper, the focus lies on the comparison of text and KG.

Specifically, this paper is concerned with the following research questions: • How does adding a KG based on one encyclopedia as training data of a LM compare to simply adding that exact encyclopedia, i.e., is creating a KG worth it for creating a knowledge infused LM? • What kind of knowledge shows the most improvement when injecting an encyclopedic

KG into a LM (factual, lexical, linguistic)? • Is a historical encyclopedia suited for historical domain adaptation? For the experiment, two German encyclopedias from the early 20th century were chosen – one for training (Meyers Großes Konversations-Lexikon [ 27 ], dated in 1905, in the following referred to as Meyers) and one for evaluation (Brockhaus Kleines Konversations-Lexikon [ 4 ], dated in 1911, in the following referred to asBrockhaus). The former has been transformed into a semantic knowledge graph by EncycNet.3 In a follow-up study, a comparison is also made between injecting contemporary linked semantic data, namely WordNet, and encyclopedic KGs in terms of improving lexical semantic relations in LMs.

2. Related Work 2.1. Knowledge Enhanced Pre-trained Language Models

The idea to inject language models with knowledge graphs belongs to the research area of knowledge enhancement. Generally speaking, not every form of knowledge can be learned by feeding vast amounts of continuous texts to a transformer model. Missing information, meaning explicit grounding in the real world 7[], is not only apparent for domain knowledge (expert knowledge about e.g. drugs and diseases) but common sense and factual knowledge as well [ 3 ]. As an example, the newest development of OpenAI to include images and other media for model training (GPT-4o) seeks to tackle the grounding problem as well.

Knowledge enhanced pre-trained language models (KEPLMs) are language models that have been tuned to accommodate a specific area of knowledge better. While algorithmic adaptation is possible, many methods for creating KEPLMs rely on additional structured knowledge to inject the LMs with. These can be, among others, additional text snippets describing concepts 1https://lila-erc.eu/ 2https://pric.unive.it/projects/pura/home 3https://encycnet.github.io/; RDF knowledge graph available athttp://dx.doi.org/10.5281/zenodo.10219192 or entities (e.g. dictionary definitions), tables, syntax trees, triples, rule systems, or knowledge graphs [ 15, 41 ]. Knowledge graphs bear an advantage over other structured data forms: They may be reshaped to other data structures and are thus highly flexible regarding the choice of method, and they can represent any type of human knowledge, meaning methods devised to accommodate knowledge graphs are flexible to adapt to any knowledge type.

Five diferent categories for knowledge enhancement using KGs can be broadly distinguished [ 28 ]. The first category is concerned with adapting the masked language modeling (MLM) training procedure (during pre-training or through continued training) using KG data. Firstly, the information given in the KG can be used to employ strategic masking during training (e.g., to mask multi-word expressions [ 35 ], assign masking probabilities for words through the graph structure [43], or mask head and tail entities when appearing in the same text passage 3[ 2 ], etc.). Secondly, the graph can be used to create new corpora through randomwalks1[ 7 ], which can be used for MLM the same way natural continuous text can. The second category deals with employing additional tasks, either during pre-training or fine-tuning, which also use the KG as training data. These tasks can be, for example, creating stable knowledge graph embeddings [ 40 ], or predicting head, relation or tail of triples from the KG2[9]. The third category attends to input fusion of KG and text, either by merging text into graph 3[ 4 ], graph into text [ 23 ], or merging features from the graph into the input layer of the transformer model2[0]. These three categories have in common that they all aim to change the parameters of the language model. The final two categories of KEPLMs use KGs at inference (retrieval augmented generation) [ 24 ], or use the KG as evaluation data for interpretability and probing matters3[ 6 ], where in both cases the language model keeps its original parameter configuration.

An additional trend for KEPLMs is the usage of adapters. First introduced by3[9] as K(nowledge)-adapters, adapters are a set of layers introduced to the transformer model, where during training, only the parameters of the adapters are changed, while the rest of the LM stays frozen. This is meant to minimize ”forgetting”, where the original knowledge learned during pre-training gets overwritten, and thus ensures that the injected knowledge stays independent. In that way, multiple knowledge types can be injected into the model without interfering with each other or the original model (e.g., as per 3[9], factual and linguistic adapters).

In the following study, the focus lies on randomwalk generation as well as using adapters for training. Randomwalks have been previously employed for knowledge injection for a multitude of knowledge domains and tasks: factual and common sense knowledge1[ 7 ], eventuality modeling [ 42 ], entity classification and link prediction for the biomedical domain [ 37 ], as well as lexical, medical and factual knowledge graph completion 2[ 1 ]. The method has also previously been employed to create taxonomic word embeddings1[ 6 ]. The intuition of the approach lies in the assumption that traversing randomwalks in a graph can efectively capture its entire topology and map its contents into latent space (node2vec algorithm1[0]). The randomwalk injection method was preferred here, as it allows for a fair comparison between the encyclopedia enhanced and knowledge graph enhanced language models. As the graph is deconstructed into text form, both can be created with continued MLM training, and only the input representation (continuous text vs. randomwalks) is diferent.

2.2. Domain Adaptation

The field of KEPLMs shares significant overlap with the research area of domain adaptation. As already briefly mentioned, domain adaptation is concerned with retroactively fitting general pre-trained LMs to a domain-dependent task. Some of the approaches in creating KEPLMs are quite similar, which is when structured knowledge is used to retroactively adapt a LM instead of influencing pre-training or inference. In domain adaptation, similar methods are for example continued MLM pre-training [ 11 ] or employing diferent masking strategies [ 2 ].

While these fields share the aspect of subsequent model fitting, KEPLMs prioritize using structured input data regardless of domain. Much work in this area focuses on improving factual or common sense related knowledge, not least because this is where most of the structured resources are digitally available (most importantly Wikidata and ConceptNet). Domain adaptation focuses more on solving the domain specific task, regardless of additional input used. This paper seeks make this connection explicit and set an example for the combination of the two fields, namely using a KG to adapt a LM to the historical German knowledge domain.

3. Infusing Language Models with Historical Encyclopedic Knowledge 3.1. Workflow Overview

A schematic representation of the proposed workflow can be found in Figure 1. As [ 17, 42, 37, 21 ] have demonstrated, randomwalks can be used to infuse LMs with new information, or, in the case of [ 16 ], even be the sole information source to build type embeddings. In this paper, the method for randomwalk creation was borrowed and adapted from1[ 7 ]. All triples were extracted from Meyers’ knowledge graph, where the predicates were resolved to simple German. As the original graph uses Wikidata properties, their German aliases were used for the verbalization (e.g. ”P5973” to ”Synonym”). [ 16 ] have shown that a non verbalization worked the best in their case, however as LMs process whole sentences, this simple verbalization method was chosen here instead. The triples were parsed with networkX, and node2vec was used to create the walks. The procedure was slightly adapted from [17]: More unspecific relations, particularly ”related to,” were assigned a lower edge weight to reduce their probability of being selected during walks. Additionally, multiword expressions were not combined with underscores in this case. In total, 752,230 randomwalks were created. Examples can be found in Table 4 in the Appendix.

As a starting point, the current German state-of-the-art for encoder-decoder based models, gBERT-large,4 was used, and an adapter using the LoRA [ 14 ] configuration was added. For this training setup, this means that only 0.234% of the original parameter size had to be trained (about 786K instead of about 335M). Using the encyclopedia’s original text (see examples in Table 5), one adapter was trained on the MLM task. Then, another adapter was trained separately on the randomwalk KG representation of the same encyclopedia. Both adapters were trained using the same hyperparameters each, which are 8 epochs, MLM probability of 0.15, and learning rate of 1e-4. Additionally, the model’s perplexity 3[ 1 ] during the randomwalk training was calculated on a sample of the OSCAR dataset (used for the pre-training of gBERT) over the course of 24 epochs (see Appendix4). Here, it can be seen that even though the use of an adapter should mitigate forgetting pre-trained knowledge, the perplexity increases quite steadily for OSCAR. However, it also declines for the randomwalks, confirming that the model is improving on this dataset during training. This shows that there is still a trade-of, and training with the randomwalk corpus should not be extended beyond a certain point, which is why the training was stopped at epoch 8.

The evaluation procedure relies on predicting the correct word from a given word plus word relation using the fill-mask pipeline. The creation of these word pair datasets is described in the following section. Using the [MASK]-token and a verbalization of the expected relation, the LM is prompted to predict the second word of the pair. Some examples can be found in Table 6 in the Appendix. Then, the performance is calculated by the correct hits within the top predictions of the LM. Other evaluation methods focus on embedding extraction of word types by fusing the token embeddings from multiple sentences and measuring the relationship via cosine distance. As the embedding method could be volatile to sentence sampling, and could potentially conflate the diferent dimensions of word ”closeness” through just cosine distance, the evaluation strategy used here seeks to negate the randomness through sampling and takes the nuances of word relations into account.

3.2. Creation of the Evaluation Dataset

The evaluation consists of probing the original and knowledge infused LMs on diferent knowledge types, which is meant to assess which information can actually be ingested with the proposed workflow: factual, linguistic, and lexical semantic knowledge. Several diferent datasets consisting of word pairs were constructed to cover these three tasks. The word pairs were extracted with regular expressions from another German encyclopedia of the same time period (Brockhaus) to make sure that the historical variation and text genre of the input encyclopedia also matches the evaluation data. The decision to use two diferent encyclopedias for training and testing stems from the nature of the task, which is not about generalizing knowledge, but rather about learning specific, encyclopedic relations such as synonyms and factual associations. Unlike more general language tasks, the relations captured in an encyclopedia – especially those pertaining to domain-specific knowledge – are inherently difÏcult to generalize beyond their specific context. By testing on a second encyclopedia, Brockhaus, the aim is to evaluate how well the model has internalized and can retrieve the learned relationships rather than generalizing abstract patterns; a similar approach to earlier model ”semantic retrofitting” methods [ 8, 33 ].

For the evaluation data, 5 diferent types of word pair lists were constructed: people and their year of birth,5 places and where they are located, words and their language of origin, pairs of synonyms, and definitions of concepts (also referred to as is-a relation or hypernyms). The first two datasets represent factual knowledge, the third dataset represents linguistic knowledge, and the last two represent lexical semantic knowledge.

However, the content of encyclopedias in general is not only historical, but at times extremely detailed, as they do not only cover general knowledge but a lot of domain specific knowledge as well, such as chemistry or botany, for instance. Similarly, some facts are easier than others depending on how well known the entity in question is. Where possible, the datasets were separated into two splits: simple and expert knowledge. For the dataset about places, the population size along with the location were extracted. All places with a (historical) population size exceeding 70,000 were added to the simple knowledge category. Places with a population size between 30,000 and 70,000 were counted as expert knowledge. For both lexical semantic datasets, GermaNet [ 12 ] was used to gauge the level of specificity of the word pairs. In more precise terms, the corresponding synset was retrieved for the second word of the pair and along with it its level in the hierarchy of GermaNet terms. From a psychological point of view, a higher hypernym depth in the hierarchy would correspond with a higher specificity / expert knowledge, while a more shallow depth would insinuate a simpler kind of knowledge. When given more than one synset for one word, the minimum depth of these synsets was chosen. The extracted hypernym depths of synonyms exhibit a mean of 7.71 (median of 7), while the is-a pairs have a mean of 6.52 (median of 6). The diference is to be expected, as the definitions should always indicate an upper hierarchical level in contrast to the synonyms. When comparing the encyclopedia synonyms to another dataset commonly used for evaluating word-level similarity (German translation of SimLex [ 19 ]), the ”expertness” of the encyclopedia becomes apparent. The mean depth of SimLex word pairs is 5, meaning that on average, SimLex pairs are 2 hierarchy levels above encyclopedia pairs (see distribution comparisons in Figur2e).

As a result, all word pairs where the predicted word has a hypernym depth of 6 or lower were categorized as simple, and 7 and up counted as expert6. For the linguistic and year of birth datasets, the data were not split because no immediate additional feature for separation could be identified. All created datasets can be found on github.7 5The birth dates in theBrockhaus dataset exhibit a median of 1811 and are highly skewed, with a long tail extending back to the year 1000. The 25th percentile (Q1) is 1757, and the 75th percentile (Q3) is 1835. 6While two separate thresholds could have been introduced here, a single ”expertness” threshold ensures a reliable comparison across both datasets. It is based on the assumption that a lexeme’s ”expertness” level should not change depending on whether it appears in a synonym or hypernym context. 7https://github.com/ThoraHagen/HistED/

3.3. Results

The results of both encyclopedia and encyclopedia-KG adapted models can be found in Table 1. For evaluation, hits@n were calculated across all datasets except foryear of birth, where the average prediction error (distance from the true year) was used instead. The hits@n metric relfects the proportion of correct answers ranked in the top n model fill-mask predictions, where higher values indicate better performance. Overall, the KG-based domain adaptation is able to outperform the full-text adaptation, though the degree of improvement varies across tasks. Factual Knowledge Both evaluation datasets representing factual knowledge exhibit some improvement, albeit minor. As indicated above, for theyear of birth dataset, the evaluation focused on the average diference between the actual year of birth and the top 3 predicted years, because the hits@n metric for all three models yielded near-zero scores, even when n was set to large values. This approach provides a better sense of how close the models’ predictions were to the correct year, given the low performance in ranking accuracy. It can be seen that even though the model makes a somewhat better educated guess (as in ”in an encyclopedia published in 1905 there should not be any birthdates mentioned before that”) as the average distance is reduced by about 50 years, the precision is still poor. A qualitative review of the prompts did not find any correlation between correct guesses and a person’s fame (as fame may reflect both simple and expert knowledge in this case). Further work on quantifying fame and splitting the dataset accordingly is necessary to confirm this notion. A similar sentiment can be observed with thelocation datasets. Both simple and expert location knowledge exhibit minor improvements of about 1-4 percentage points (pp.) more hits@10. One possible explanation could be that the majority of information about locations is already contained through the pre-training of gBERT-large, and not many evaluation examples contain information that changed until today. The location dataset is quite fine-grained, meaning that rather than countries, smaller regions are given as the true label, which also afects the exact prediction accuracy. A qualitative examination of some evaluation instances show that more sensible location predictions were made overall, even if the exact label is not predicted (see Appendix 6). However, similar to thebirthyear dataset, the accuracy is very low. Linguistic Knowledge The task for assigning the origin language to a word represents linguistic knowledge in this setup (for exampleAbsolut and Latin). Because the outcome space of the prediction is presumably much more limited than for the other datasets, the evaluation setup was narrowed to hits@3. The observed improvements for both gBERT ency and gBERT ency-KG are quite high, with 13 and 23 pp. more hits respectively. Out of all datasets, the improvements are the highest here. However, it needs to be addressed that this dataset is quite imbalanced, as most true labels are either French, Latin, or Greek, meaning that the improvements seen could just be the nature of a language distribution shift. Other words with a diferent language of origin may not be predicted as well. In terms of a historical domain adaptation, it still can be said that the method performs as intended: It is more likely that a word in a German historical encyclopedia stems from one of these three languages, which is exactly what the dataset reflects.

Lexical Semantic Knowledge While for both synonym andis-a relations some improvements can be observed, the two datasets perform quite diferently. Firstly, the is-a relations outperform the synonyms, with about 7 pp. more hits for the former and merely 2 pp. more hits for the latter concerning the simple relations. Secondly, both lexical expert variants fall behind their simple counterparts, with only about a 4 pp. diference for definitions and a 1 pp. diference for synonyms. Both results indicate that simpler lexical knowledge is more beneficial to language models than expert lexical knowledge. One could assume that the simpler knowledge would already be contained in gBERT through the OSCAR dataset pre-training, and that the injection would benefit the representation of specialized knowledge more, so this is a surprising result.

In summary, it can be said that 1) factual knowledge shows a trend towards improvement but lacks the specificity that these two datasets demand, 2) linguistic knowledge shows greater improvements, however this result may stem from a simple distribution shift, 3) lexical knowledge shows greater improvements for the upper hierarchy level ofis-a relations while synonyms are harder to predict. Across the datasets but especially for lexical knowledge, simple knowledge still bears more room for improvement, while expert knowledge is harder to ingest. This may seem surprising, as previous studies have demonstrated that language models already possess, or have largely mastered, basic semantic knowledge [ 25, 6 ].

4. Lexical Semantic Knowledge for LM Infusion

In this section of the paper, the focus therefore lies on confirming whether LMs still have room for improvement for contemporary lexical semantic knowledge by removing two confounding factors from the previous experiment: original text type (encyclopedia) and historicity. This is why instead of the encyclopedia KG, WordNet is used for LM injection in the following experiment.

Concerning KEPLMs, many studies have been conducted that evaluated mostly on factual knowledge, task-based common sense or domain knowledge, and lexical-based studies are rather rare (see [ 15 ] for an overview of recent KEPLM studies). To the best of available knowledge, there are no studies that explicitly evaluate the upper limit of lexical semantic knowledge improvement for randomwalk-fitted LMs. Similar studies focusing on lexically informed LMs are most importantly LIBERT [18] and Mirror-BERT [ 22 ]. LIBERT introduces a new classification loss during pre-training based on whether a given tuple holds a semantic relation using WordNet plus Roget’s Thesaurus. The authors evaluate on the GLUE benchmark, where the focus lies on sentence level semantics, as well as the lexical simplification task, a variant of assessing word level similarity using context from sentences. Mirror-BERT does not rely on external data but instead introduces text corruption, where the model learns to cluster true and false (corrupted) text samples. Evaluation is based on sentence level and word level tasks, including word level similarity.

Diferent from LIBERT and Mirror-BERT (aside from the injection method), this section also takes diferent model sizes into account and evaluates on three diferent lexical tasks: association, similarity and entailment.

4.1. Methodology

A visualization of the WordNet workflow can be found in Figure 3. First, all triples were extracted from the WordNet database. In a second step, the relations were verbalized to mimic natural language, e.g. ”synonym” to ”is a synonym of.” Again, the verbalized triples were parsed with networkX and the node2vec algorithm was used to create 258,239 randomwalks. Some examples of WordNet randomwalks can again be seen in Table4 in the Appendix.

To evaluate the retrofitting efectiveness for lexical semantics in particular, five datasets were chosen as stand-ins for three diferent lexical semantic tasks: SimLex [ 13 ] and SimVerb [9] for evaluating semantic word similarity, WordSim 1[] and MEN [ 5 ] for semantic word relatedness, and HyperLex [ 38 ] for evaluating lexical entailment. All datasets are score-annotated word pairs, e.g. on a scale of 0 to 10,happy and cheerful score a similarity of 9.55 (SimLex).

Because these datasets represent the strength of one semantic relationship between two words rather than a binary relation, the evaluation method was slightly adapted in this experiment. Similar to before, by using the fill-mask strategy, the evaluation focuses on probing each language model on relation prediction given the first word of each pair. However, the top 100 words are predicted here as compared to the previous experiment. Here, the inverse indices of all word pair matches are compared to the true dataset scores using Spearman’s correlation. For example, the RoBERTa-large model predictscheerful from the task ”happy is a synonym of <mask>” at rank 5, which would translate to a similarity score of 95. In other words, scores are assigned to word pairs by their prediction ranking of each model.

Multiple models of diferent parameter sizes as well as pre-training text-sizes are compared to assess how the method scales with these model diferences. Similar to the experiment before, only LoRA-adapters were trained instead of the whole model. In comparison to the BERTfamily of encoder-decoders, Llama-38 was also included as a point of reference for large language decoder-only models. To match the evaluation strategy of predicting a single word, the instruct variant (specialized on adapting to user generated tasks) was chosen over the chat variant (specialized on text generation). Here, an adapter was prompt-trained to predict the object given a subject and predicate statement using WordNet triples, similar to the fill-mask task of encoder-decoders (for a similar approach see 3[ 6 ]). An overview of all models can be found in Table 2. All WordNet adapters were trained three separate times to mitigate possible model unstableness9 due to the random weight initialization, and the mean Spearman’s correlations 8https://huggingface.co/meta-Llama/Meta-Llama-3-8B-Instruct. The model weights were cast tobfloat16 for memory efÏciency (low precision training). 9The results exhibit a mean standard deviation of 0.01. Standard deviations were calculated per dataset and model. DistilBERT-base DistilBERT-base WN Δ BERT-base BERT-base WN Δ BERT-large BERT-large WN Δ DistilRoBERTa-base DistilRoBERTa-base WN Δ RoBERTa-base RoBERTa-base WN Δ RoBERTa-large RoBERTa-large WN Δ Llama-3-8B-instruct Llama-3-8B-instruct WN Δ across these three adapters per model are reported.

4.2. Results

The results of the WordNet adapted models can be found in Table3. For all the models, the injection of WordNet is able to benefit the representation of lexical semantics using a randomwalkadapter.

Overall, a higher parameter size is beneficial for this approach, not only in terms of the generally best performing models, but highest performance jumps as well. The two word relatedness tasks, WordSim and MEN, do better on models trained with less text, while word similarity and lexical entailment do better on the models trained with more text. An indicator for this kind of separation could be the clearness of the evaluated relation: While semantic relatedness indicates the degree of association between two words, semantic similarity indicates the degree of synonymy and lexical entailment the degree of hypernymy. Compared to the latter two, the former is a much more fuzzy concept. This could indicate that with increasing parameter size, more refined relations can be better represented instead of just word association. In the case MEN of hypernymy, which is not a symmetrical relation compared to the other two tasks, RoBERTalarge has the largest overall performance and largest performance diference. The same trend can be found in the non-fitted versions of the models. For RoBERTa-large, both similarity and entailment are already represented significantly better in comparison to the mean of the other models, while the performance on relatedness is comparable to the others. Concerning the Llama model, even though it is also showing signs of improvement, it cannot compare to the encoder-decoder-based models in this setup. A similar trend like for the large models shows however, which is that on average, associations show the least improvement, followed by similarity, and finally entailment benefits the most. The contrast to the other models may stem from the diferences in model pre-training and not necessarily because of size diferences only. Further studies will be needed to explore how to better tailor the lexical adapter approach to decoder-only models.

The results indicate that more refined tasks, here lexical entailment, benefit more from the increased model size, while the less precise association task shows more stagnation across different model sizes. In terms of the pre-training corpus size, the results are less intuitive. The distilled variant of RoBERTa does not show any significant advantage over its BERT counterpart. For the base variant, again, only synonyms and entailment show minor improvements over BERT-base. When using WordNet randomwalks for creating a lexically informed LM, it can be seen that models with more parameters benefit from the method for synonym and entailment relations. Corpus size may only matter when both parameter and text size are comparatively high. For word association, the performance diferences are generally not as high and the task shows a negative correlation with original corpus size. The assumption that larger models already contain the majority of lexical knowledge and do not benefit from lexical injections is therefore not true, and the results align with previous studies in this regard1[ 8, 22 ].

5. Summary and Outlook

In summary, this paper has shown that extracting a KG from a resource can be helpful when domain-adapting a general LM to a historically informed LM. The models were evaluated with a two-dimensional approach: one categorical dimension for the type of knowledge (factual, linguistic, lexical semantics) and another binary dimension to distinguish simple from expert knowledge. The main finding is that surprisingly, simple knowledge still bears the most potential for improvement while expert knowledge falls behind. The WordNet follow-up study confirmed that language models can still be enhanced with simple lexical knowledge.

Regarding the question of whether a historical encyclopedia is suited for historical domain adaptation, it can be said that that it depends on the use case of the language model. Encyclopedias contain specialized knowledge because the diverse fields of expertise discussed, as well as the historical perspective, directly influence its lexical richness. Thus, encyclopedias contain specialized knowledge also in terms of expert semantic relations. When using the approach discussed here, one should target more precisely what kind of knowledge to inject. Using the entire knowledge graph may not send strong enough signals for fitting a specific task. Here, broad signals such as simple definitions and language of word origin showed the best results, while synonyms especially could not be represented as efectively. Employing a knowledge graph, future work could therefore explore multiple ways of limiting the training data to either specific relations (e.g. to target synonyms only) or historical knowledge domains. When controlling for these two confounding factors (expert domains discussed plus historical expertise) using WordNet, it can be observed that the same method is capable of injecting contemporary lexical knowledge such as synonymy into LMs, where even the larger models generally perform better.

In future work, concerning model analysis, the test suite for the encyclopedic evaluation will be diversified more. Currently, a binary classification of simple and expert knowledge, determined through an automatic approach using GermaNet, is being used. However, the dataset might exhibit a more intuitive notion of expert knowledge when manually annotating and deriving a continuous score from the annotations. Additionally, more relations will be added to the dataset to ensure that the results do not stem from peculiarities of the chosen relation and better represent the overall task.

There are more nuances to model training in this study that have not been taken into account yet. For one, the hyperparameters have been kept stable for the entirety of the experiments to ensure comparability between models. Potentially, this means that the upper bound of the KG injection models have not been reached. Another question to pursue would be how this method transfers to other tasks based on sentences. Instead of MLM adapters, the training of task-based adapters such as NLI is also possible. In future work, the evaluation could then also focus on how stacking both the KG adapter and another task-trained adapter (with both adapters activated during inference) could influence task performance. The hypothesis could be that certain tasks that rely on lexical information, such as sentiment prediction or semantic textual similarity, could also benefit from WordNet, for example. Finally, future work will also aim to better understand the diferences between encoder-decoder and decoder-only language models. The disparities in pre-training (MLM vs. causal language modeling) may have significant impacts on infusing these models with more knowledge. Therefore, diferent injection strategies or prompting strategies will need to be compared to better assess the possibilities of knowledge-enhanced pre-trained LLMs.

A. Lauscher, O. Majewska, L. F. Ribeiro, I. Gurevych, N. Rozanov, and G. Glavaš. “Common Sense or World Knowledge? Investigating Adapter-Based Knowledge Injection into Pretrained Transformers”. In:Proceedings of Deep Learning Inside Out (DeeLIO): The First Workshop on Knowledge Extraction and Integration for Deep Learning Architectures. 2020, pp. 43–49. [43] T. Zhang, C. Wang, N. Hu, M. Qiu, C. Tang, X. He, and J. Huang. “DKPLM: decomposable knowledge-enhanced pre-trained language model for natural language understanding”. In: Proceedings of the AAAI Conference on Artificial Intelligence . Vol. 36. 2022, pp. 11703– 11711.

A. Perplexity B. Randomwalk Examples

unmentionable is similar to impermissible. impermissible is similar to tabu. tabu is a synonym of proscribed. proscribed is a synonym of forbidden. forbidden is similar to impermissible. impermissible is similar to proscribed. proscribed is a synonym of prohibited. albuterol is a bronchodilator. bronchodilator is a medication. medication is a synonym of medicinal_drug. medicinal_drug is a synonym of medication. medication is a synonym of medicament. medicament is a synonym of medicinal_drug. medicinal_drug is a drug. Kirrung verwandter Begrif Ankörnen. Ankörnen verwandter Begrif Blasenfüßer. Blasenfüßer Hyperonym gelbbraune Dracänenblasenfuß. gelbbraune Dracänenblasenfuß Hyponym Thrips. Thrips Definition Insektengruppe.

Synonymenwörterbuch verwandter Begrif Wörterbuch. Wörterbuch verwandter Begrif Handwörterbuch. Handwörterbuch verwandter Begrif Frerichs. Frerichs Synonym Friedrich Theodor Frerichs. Friedrich Theodor Frerichs geboren 24. März 1819.

Blasenfüßer (Physopoda, Thysanoptera), Insektengruppe von sehr zweifelhafter Stellung im System, wird zu den Falschnetzflüglern gestellt und umfaßt winzige Tierchen mit zylindrischem Kopf, saugenden Mundwerkzeugen, sehr schmalen, stark befransten Flügeln, die bisweilen auch fehlen, und runden Hastscheiden statt der Klauen an den Füßen. Die B. leben auf Blättern, nehmen die zarte Oberhaut derselben weg und erzeugen dadurch oft bedeutenden Schaden. [...] Wörterbuch (Lexikon), ein in rein alphabetischer oder alphabetisch-etymologischer Ordnung verfaßtes Verzeichnis von Wörtern und Eigennamen (welch letztere aber bisweilen fehlen oder ein besonderes W. bilden) mit oder ohne beigefügte Erklärung in der nämlichen oder in einer andern Sprache. [...]

D. Example Predictions from the Fill-Mask Pipeline

verbalization (Leonian contract) is a [MASK]. (Maidstone) is located in [MASK]. (Toll) is a synonym of [MASK]. (William George Armstrong) was born in year [MASK]. (Accurate) is a word from the language [MASK]. model gBERT gBERT gBERT gBERT ency-KG gBERT ency-KG gBERT ency-KG gBERT

[1]

Agirre ,

Alfonseca ,

Hall ,

Kravalova ,

Pasca , and

Soroa . “ A study on similarity and relatedness using distributional and wordnet-based approaches”. InP:roceedings of human language technologies: The 2009 annual conference of the north american chapter of the association for computational linguistics . 2009 , pp. 19 - 27 .

[2]

Aragon ,

A. P. L.

Monroy ,

Gonzalez ,

D. E.

Losada , and

Montes . “ DisorBERT: A double domain adaptation model for detecting signs of mental disorders in social media”.

In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) . 2023 , pp. 15305 - 15318 .

[3]

Bang ,

Cahyawijaya ,

Lee ,

Dai ,

Su ,

Wilie ,

Lovenia ,

Ji ,

Yu ,

Chung , et al. “ A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity” . In:Proceedings of the 13th International Joint Conference

[4]

F. A.

Brockhaus , ed. Brockhaus' Kleines Konversations-Lexikon. 5th ed. Leipzig: Brockhaus , 1911 .

[5]

Bruni ,

N. K.

Tran , and

Baroni . “Multimodal Distributional Semantics”. InJ:ournal of Artificial Intelligence Research 49 ( 2014 ), pp. 1 - 47 .

[6]

T. A.

Chang and

B. K.

Bergen . “ Language model behavior: A comprehensive survey” . In: Computational Linguistics 50.1 ( 2024 ), pp. 293 - 350 .

[7] [8] [9] [10]

D. Coelho

Mollo and

Millière . “ The vector grounding problem” . Ina:rXiv preprint arXiv:2304.01481 ( 2023 ).

Faruqui ,

Dodge ,

S. K.

Jauhar ,

Dyer , E. Hovy, and

N. A.

Smith . “ Retrofitting word vectors to semantic lexicons” . In:NAACL HLT 2015 - 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , Proceedings of the Conference i (2015) , pp. 1606 - 1615 . doi: 10 . 3115 / v1 / n15 - 1184 .

arXiv: 1411 . 4166 .

Gerz ,

Vulić ,

Hill ,

Reichart , and

Korhonen . “SimVerb-3500:

Large-Scale Evaluation Set of Verb Similarity” . In:Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing . 2016 , pp. 2173 - 2182 .

Grover and

Leskovec . “node2vec: Scalable feature learning for networks” . In:Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining . 2016 , pp. 855 - 864 .

[11]

Gururangan ,

Marasović ,

Swayamdipta ,

Lo ,

Beltagy ,

Downey , and

N. A.

Smith . “ Don't Stop Pretraining: Adapt Language Models to Domains and Tasks”. InP: roceedings of the 58th Annual Meeting of the Association for Computational Linguistics . 2020 , pp. 8342 - 8360 .

[12]

Hamp and

Feldweg . “ GermaNet-a lexical-semantic net for German” . In:Proceedings of the ACL workshop Automatic Information Extraction and Building of Lexical Semantic Resources for NLP Applications . 1997 , pp. 9 - 15 .

[13]

Hill ,

Reichart , and

Korhonen . “SimLex-999: Evaluating Semantic Models With (Genuine) Similarity Estimation” . In:Computational Linguistics 41.4 ( 2015 ), pp. 665 - 695 . doi: 10 .1162/COLI\_a\_ 00237 .

[14]

E. J.

Hu ,

Wallis ,

Allen-Zhu ,

Li ,

Wang ,

Chen , et al. “ LoRA: LowRank Adaptation of Large Language Models” . In:International Conference on Learning Representations . 2021 .

[15]

Hu ,

Liu ,

Zhao ,

Hou ,

Nie , and

Li . “A survey of knowledge enhanced pre-trained language models” . In:IEEE Transactions on Knowledge and Data Engineering ( 2023 ).

[16]

Klubička ,

Maldonado ,

Mahalunkar , and

Kelleher . “ English wordnet random walk pseudo-corpora” . In:Proceedings of the Twelfth Language Resources and Evaluation Conference . 2020 , pp. 4893 - 4902 .

Lauscher ,

Vulić ,

E. M.

Ponti ,

Korhonen , and G. Glavaš. “ Specializing unsupervised pretraining models for word-level semantic similarity” . In:arXiv preprint arXiv: 1909 . 02339 ( 2019 ).

[19] I. Leviant and

Reichart . Separated by an Un-common Language: Towards Judgment Language Informed Vector Space Modeling . 2015 . arXiv: 1508 .00106 [cs.CL].

[20]

Levine ,

Lenz ,

Dagan ,

Ram ,

Padnos ,

Sharir ,

Shalev-Shwartz ,

Shashua , and

Shoham . “ SenseBERT: Driving Some Sense into BERT” . In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics . Ed. by

Jurafsky ,

Chai ,

Schluter , and

Tetreault . Online: Association for Computational Linguistics, 2020 , pp. 4656 - 4667 . doi: 10 .18653/v1/ 2020 .acl-main. 423 .

[21]

Lin ,

Mao ,

Liu ,

Xu , and E. Cambria. “ Fusing topology contexts and logical rules in language models for knowledge graph completion” . In:Information Fusion 90 ( 2023 ), pp. 253 - 264 .

[22]

Liu , I. Vulić ,

Korhonen , and

Collier . “Fast, Efective, and Self-Supervised: Transforming Masked Language Models into Universal Lexical and Sentence Encoders” . In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing . 2021 , pp. 1442 - 1459 .

[23]

Liu ,

Zhou ,

Zhao ,

Wang ,

Ju ,

Deng , and

Wang. “ K-BERT: Enabling language representation with knowledge graph” . In:Proceedings of the AAAI Conference on Artificial Intelligence . Vol. 34 . 2020 , pp. 2901 - 2908 .

[24]

Logan ,

N. F.

Liu ,

M. E.

Peters ,

Gardner , and

Singh . “ Barack's Wife Hillary: Using Knowledge Graphs for Fact-Aware Language Modeling” . In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics . Ed. by

Korhonen ,

Traum , and

Màrquez . Florence, Italy: Association for Computational Linguistics, 2019 , pp. 5962 - 5971 . doi: 10 .18653/v1/ P19 -1598.

[25]

Mahowald ,

A. A.

Ivanova ,

I. A.

Blank ,

Kanwisher ,

J. B.

Tenenbaum , and E. Fedorenko. “ Dissociating language and thought in large language models” . InT:rends in Cognitive Sciences ( 2024 ).

[26]

Manjavacas and

Fonteyn . “ Adapting vs. pre-training language models for historical languages” . In: Journal of Data Mining & Digital Humanities ( 2022 ).

[27] J. Meyer, ed. Meyers Großes Konversations-Lexikon. 6th ed. Leipzig: Bibliographisches Institut , 1905 - 1909 .

[28]

Pan ,

Luo ,

Wang ,

Chen ,

Wang , and

Wu . “ Unifying large language models and knowledge graphs: A roadmap” . In:IEEE Transactions on Knowledge and Data Engineering ( 2024 ).

[29]

Qin ,

Lin ,

Takanobu ,

Liu ,

Li ,

Ji ,

Huang ,

Sun , and

Zhou . “ERICA: Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive Learning” . In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1 : Long Papers). Ed. by

Zong ,

Xia ,

Li , and

Navigli . Online: Association for Computational Linguistics, 2021 , pp. 3350 - 3363 . doi1 : 0 .18653/v 1/ 2021 . acl-long . 260 .

[30] M.-L. Ryan . “ Fiction, non-factuals, and the principle of minimal departure” . In:Poetics 9.4 ( 1980 ), pp. 403 - 422 .

[31]

Serrano ,

Brumbaugh , and

N. A.

Smith. “Language Models: A Guide for the Perplexed” . In: arXiv preprint arXiv:2311.17301 ( 2023 ).

[32]

Shen ,

Mao ,

He ,

Long ,

Trischler , and

Chen . “ Exploiting Structured Knowledge in Text via Graph-Guided Representation Learning” . In:Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) . Ed. by

Webber , T. Cohn,

He , and

Liu . Online: Association for Computational Linguistics, 2020 , pp. 8980 - 8994 . doi: 10 .18653/v1/ 2020 .emnlp-main. 722 .

[33]

Speer and

Lowry-Duda . “ConceptNet at SemEval-2017 Task 2: Extending Word Embeddings with Multilingual Relational Knowledge” . In:Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017) . 2017 , pp. 85 - 89 .

[34]

Sun ,

Shao ,

Qiu ,

Guo ,

Hu ,

X.-J.

Huang , and

Zhang . “ CoLAKE: Contextualized Language and Knowledge Embedding” . In:Proceedings of the 28th International Conference on Computational Linguistics . 2020 , pp. 3660 - 3670 .

[35]

Sun ,

Wang ,

Li ,

Feng ,

Chen ,

Zhang ,

Tian ,

Zhu ,

Tian , and

Wu . “Ernie: Enhanced representation through knowledge integration” . Ina:rXiv preprint arXiv: 1904 . 09223 ( 2019 ).

[36]

Sung ,

Lee ,

S. Y.

Sean ,

Jeon ,

Kim , and

Kang . “ Can Language Models be Biomedical Knowledge Bases?” In:2021 Conference on Empirical Methods in Natural Language Processing , EMNLP 2021 . Association for Computational Linguistics (ACL ). 2021 , pp. 4723 - 4734 .

[37]

Tan ,

Zhou ,

Lv , W. Liu, and

Yang . “ Walklm : A uniform language model finetuning framework for attributed graph embedding” . In:Advances in Neural Information Processing Systems 36 ( 2024 ).

[38]

Vulić ,

Gerz ,

Kiela ,

Hill , and

Korhonen . “ HyperLex: A Large-Scale Evaluation of Graded Lexical Entailment” . In:Computational Linguistics 43.4 ( 2017 ), pp. 781 - 835 . doi: 10 .1162/COLI\_a\_ 00301 .

[39]

Wang ,

Tang ,

Duan ,

Wei ,

X.-J.

Huang ,

Ji ,

Cao ,

Jiang , and

Zhou . “ K-Adapter : Infusing Knowledge into Pre-Trained Models with Adapters”. InF:indings of the Association for Computational Linguistics: ACL-IJCNLP 2021 . 2021 , pp. 1405 - 1418 .

[40]

Wang ,

Gao ,

Zhu ,

Zhang ,

Liu ,

Li ,

and J.

Tang . “KEPLER: A unified model for knowledge embedding and pre-trained language representation” . InT:ransactions of the Association for Computational Linguistics 9 ( 2021 ), pp. 176 - 194 .

[41]

Yang ,

Xiao ,

Shen ,

Jiang ,

Hu ,

Zhang , and

Peng . “ A survey of knowledge enhanced pre-trained models” . In:Journal of the Association for Computational Machinery 37.4 ( 2023 ).

[42]

Yu ,

Zhang ,

Song , and

Ng . “ CoCoLM: Complex Commonsense Enhanced Language Model with Discourse Relations” . In:Findings of the Association for Computational Linguistics: ACL 2022 . 2022 , pp. 1175 - 1187 .