Improving Language Model Predictions via Prompts Enriched with Knowledge Graphs ⋆

Improving Language Model Predictions via Prompts Enriched with Knowledge Graphs ⋆ RyanBrate r.brate@gmail.com KNAW Humanities Cluster Digital Humanities Lab

Amsterdam Netherlands

Minh-HoangDang minhhoangdang@hotmail.com Faculté des Sciences et Techniques (FST) LS2N Université de Nantes

France

FabianHoppe fabian.hoppe@kit.edu Leibniz Institute for Information Infrastructure FIZ Karlsruhe

Germany

Karlsruhe Institute of Technology Institute AIFB

Germany

YuanHe yuan.he@cs.ox.ac.uk University of Oxford

AlbertMeroño-Peñuela King's College London

VijaySadashivaiah Rensselaer Polytechnic Institute

USA

Improving Language Model Predictions via Prompts Enriched with Knowledge Graphs ⋆ 1613-0073 D8E49FBEE0111F8189A6B070D419F3AB GROBID - A machine learning software for extracting information from scholarly documents Prompt Learning Pre-trained Language Model Knowledge Graph

Despite advances in deep learning and knowledge graphs (KGs), using language models for natural language understanding and question answering remains a challenging task. Pre-trained language models (PLMs) have shown to be able to leverage contextual information, to complete cloze prompts, next sentence completion and question answering tasks in various domains. Unlike structured data querying in e.g. KGs, mapping an input question to data that may or may not be stored by the language model is not a simple task. Recent studies have highlighted the improvements that can be made to the quality of information retrieved from PLMs by adding auxiliary data to otherwise naive prompts. In this paper, we explore the effects of enriching prompts with additional contextual information leveraged from the Wikidata KG on language model performance. Specifically, we compare the performance of naive vs. KG-engineered cloze prompts for entity genre classification in the movie domain. Selecting a broad range of commonly available Wikidata properties, we show that enrichment of cloze-style prompts with Wikidata information can result in a significantly higher recall for the investigated BERT and RoBERTa large PLMs. However, it is also apparent that the optimum level of data enrichment differs between models.

Introduction

Pre-trained language models (PLMs) [1,2], based on deep learning attention-based architectures, have shown to have outstanding performance at various natural language processing (NLP) tasks predicated on natural language understanding. However, the extent to which they capture domain knowledge and empirical semantics [3] -i.e. the use of formal domain properties in practice -is not well understood. In this work, we narrow down the focus to cloze-style completion, the task of predicting the masked entity text in a sentence. For example, given: "The Klingons are a species in the franchise [MASK]", the PLM is expected to predict "Star Trek" for [MASK]. It aims to extract the implicit knowledge entailed by the PLMs, since such knowledge can be used for downstream NLP applications like sentiment analysis [4], dialogue systems [5], and natural language inference [6], as well as for completing the missing information of knowledge graphs (KGs) or ontologies [7], and even constructing new ones [8].

In recent years, PLMs have improved on the state of the art in many NLP tasks by leveraging large text corpora [9], but most of time they still require annotated data for task-specific fine-tuning [10]. However, the empirical semantics gathered by these models is limited to distributional aspects [11]. Therefore, the performance, especially in the few-and zero-shot setting, highly depends on the provided prompt, i.e. snippets of contextual information for a specific task. However, in many cases the engineering of the prompts is naive and simplistic, giving the PLM too little context to provide an accurate answer, and unsystematic, providing little principles on how exactly these prompts need to be composed in order to have a predictable behaviour. Indeed, recent studies [12] have highlighted the improvements that can be made to the quality of information retrieved from PLMs by performing amendments to these prompts. This casts doubts on some studies [13] that claim that a PLM cannot answer easy questions about e.g. culture (movies, books, music, ...), it is reasonable to postulate that PLMs could perhaps answer those questions accurately if they were provided with systematically engineered prompts that contained richer contexts.

Existing approaches of prompt engineering include: (i) learn-by-example, where the prompt consists of the concatenation of correct examples we expect a PLM to predict [2]; (ii) manually designed prompts of different granularities [13]; (iii) automatically searched prompts optimized on few-shot samples [14], all of which rely on the implicit semantics of natural language texts. In this paper, we investigate how incorporating explicit knowledge from external sources like KGs can help prompt engineering and thus enhance the cloze-style question answering of PLMs. Specifically, we explore cloze-style prompts with respect to the movie domain in respect of the performance of the BERT and RoBERTa large PLMs.

Related Work

Studies towards prompt learning are based on the hypothesis that pre-trained language models (PLMs) have learnt abundant knowledge and just require sufficiently detailed contexts for predictions [2,10,15] -and in this way, it is possible to apply PLMs without data-driven fine-tuning. A (hard1 ) prompt is the conditioning text which is combinded with the input to provide contexts or hints for the PLM. A template (i.e. pattern) is a function that integrates the inputs and prompts. Answers are then given by the PLMs conditioned on the prompts, and a further function (i.e. verbalizer) is often required to map the answers to the final outputs. The reason for that is, the prompt learning paradigm is typically formulated as a similar task to the PLM's pre-training task, which does not necessarily yield the desired outputs of downstream applications.

An important part of prompt learning is prompt engineering, i.e., to design template(s), either manually or automatically, to support downstream applications. In [2], Brown et al. proposed to use demonstrations, i.e., a sequence of input-output texts, as the prompts, expecting that the PLM can implicitly learn to predict from examples. For instance, if we want the PLM to predict the masked position in "[MASK] is the capital of China.", we can demonstrate by appending "London is the capital of the UK" after the masked sentence. Schick et al. [16] manually designed different templates, each corresponding to an individual PLM trained on few-shot examples. The predictions of downstream text classification and natural language inference tasks were then made according to an ensemble of trained PLMs. Shin et al. [14] argued that manually designed templates suffer from the uncertainty of guesswork or the lack of domain expertise. Therefore, they proposed to search for templates using gradient-based optimization. More recently, Lu et al. [17] have shown that PLMs performance varies with the order of these prompts, and use generative language models and entropy statistics on the prompt permutations to identify prompts with good performance.

KGs or ontologies are excellent sources for providing explicit knowledge to enrich prompts or verbalizers. West et al. [18] considered distilling a student model in the common sense domain from the enormously large PLM GPT-3 [2], which serves as the teacher model. They adopted the prompt learning scheme to extract triples from the teacher model with templates created and examples extracted from the common sense KG Atomic [19]. Hu et al. [7] argued that the label word space (i.e., the answer space) can be well expanded by adding in external knowledge about related words. They employed different refinement heuristics to shortlist candidates to benefit the downstream classification task. For instance, if some "Person" is classified as a "Physicist" in the ground truth data, then answers like "Scientist" will also be accepted.

Our work was motivated by the probing study of Penha et al. [13] that investigates whether BERT (a well-known PLM consisting of stacked transformer encoders [1]) actually knows superficial cultural knowledge about books, movies, and music. Cloze-style questions for classifying the genre of entities (from Wikidata) of different books, movies, and music were given for the PLM to answer, often with unsatisfying performance. However, their work considered naive prompts without sufficient contexts, while ours attempts to examine if KGs can enrich these prompts, especially giving additional contexts (e.g., attributes, 𝑘-hop neighbours) of the entities in order to help the PLM to generate better predictions.

Methodology

The basic idea of our method is to use the information about entities in KGs to expand cloze-style prompts with richer entity descriptions. It is summarized in Figure 1. We enrich the naive prompt, for example Die Hard is of genre [MASK], through matching the movie Die Hard to the corresponding Wikidata item and extract auxiliary knowledge with SPARQL queries, and generating an enriched prompt using this auxiliary data. We use datatype properties and verbalize entities using rdfs:label to compose valid phrases. As a result, we obtain e.g. We then use both (a) the naive prompts and (b) the KG-enriched prompts to query various language models, and compare their performance on the entity genre classification task. In the following paragraphs the enrichment by KG querying and the prompt engineering step are described in detail. Proposed approach to enrich the query using external language.

Knowledge Graph Querying

The auxiliary data for each movie is extracted from Wikidata. This is done in a simplistic two-step-process using SPARQL queries. The queries operate on a batch of input records to reduce the number of requests and avoid timeout errors.

First, the movies are linked to their respected Wikidata entities by IMDb or TMDB ID utilizing the Wikidata properties IMDb ID (wdt:P345) and TMDb movie ID (wdt:P4947). If this does not provide an entity, an exact string matching given the title is attempted as well.

SELECT ?mlId ?imdbId ?tmdbId ?movie WHERE { VALUES (?mlId ?imdbId ?tmdbId) {("1" "tt0114709" "862" ) ... } {?movie wdt:P345 ?imdbId . } UNION {?movie wdt:P4947 ?tmdbId .} } Listing 1: SPARQL query used for entity linking with the IMDb or TMDB ID.

The second step queries the entities for the auxiliary data used to enrich the prompts with additional contextual information. Overall, a set of 28 properties was extracted and investigated for each entity. A simplified version of the utilized SPARQL query is given in 2. This query can easily be adapted to query other properties by adding these properties to the ?property values. From this set of properties a subset of 10 manually selected domain-specific properties are used to constract the enriched prompts. The properties are selected based on human intuition and the most frequent co-occurrence for the given entities. Listing 2: Simplified SPARQL query used to retrieve additional movie knowledge from Wikidata.

Prompt Engineering

Similarly to [13], we consider an entity genre classification task. The prompts are of the form: "<title> is a movie <Wikidata enrichment>, of the genre [MASK].", where <Wikidata enrichment> is an aggregation of movie properties and corresponding values extracted from Wikidata pertaining the title in question, in some natural language format. Table 1 lists the Wikidata properties used to assemble values for <Wikidata enrichment>.

Table 1

Wikidata properties used in constructing probes for the movie dataset. 'Enrichment Text' is the text adopted in the probe enrichment to describe the property in question in a more natural language format.

The Wikidata properties listed in Table 1 are broadly ranked in descending information specificity. It was in this order, that ten variations for a probe were constructed, by sequentially adding Wikidata properties to prompts, building gradually more contextual-information dense prompts. In adding property information, only the first value of each Wikidata property was used where more than one was available (e.g., the first listed cast member). E.g., as follows; the unenriched prompt, the first two successive prompt enrichments, and the final enriched form pertaining to the movie Die Hard.:

• non-enriched prompt: Die Hard is a movie, of the genre [

Evaluation

Dataset

In order to test our approach, we use the BERT [1] and RoBERTa large [20] pre-trained models.

The test dataset we are using is a subset of ML25M from IMDB [21]. ML25M contains title and ground truth genre classification of a range of 54,758 movies. A subset of this dataset was then assembled, as those movies for which the Wikidata properties as listed in Table 1 were present in full. This resulted in a test set of 9,596 movie titles. The Wikidata properties, and thus the corresponding data subset, were selected as a compromise between a large dataset, and a diverse set of domain-relevant Wikidata properties, following exploratory analysis of the ML25M dataset.

Results

Table 2 lists the recall@n scores for each of the prompts described in Section 3.2, for the BERT and RoBERTa large models respectively. For a given model and prompt, recall@1 and recall@5 values for each movie are calculated as the fraction of movie ground-truth genres predicted in the highest ranked n PLM mask predictions. The aggregated recall@n values reported in Table 2 are the micro-averaged recall@n scores across all movies in the test dataset, with respect to the model and prompt referenced. With reference to Table 2, certain variations of the enriched probes showed greater R@n scores that the non-enriched case, for both the BERT and RoBERTa large models, across verbalisation strategies. We compare the statistical significance of the R@n outcomes, of the highest performing enriched prompts (bolded) against the non-enriched case, via a one-tailed, directional, dependent t-test. Where the null hypothesis is that average of the R@n differences is 0, and the alternative hypothesis is that the average of the R@n differences is non-zero, biased towards the selected enriched probe. A significance of 0.05 is applied. With reference to the p-values given in Table 3, we can affirm with statistical significance that the enriched prompts are more performant overall. Recall@n scores for the Bert, RoBERTa large and the movie data subset, averaged over all movies. Verbalisation strategy A and B prompts consist of comma separated and 'and' separated WikiData information, respectively, as described in Section 3.2. The greatest recall@n scores are highlighted in bold.

Prompt

BERT

RoBERTa large

VerbalisationStrategy A Verbalisation Strategy B Verbalisation Strategy A Verbalisation Strategy B R@1 R@5 R@1 R@5 R@1 R@5 R@1 R@

Discussion

The results and analysis of Section 4.2 give support to the position that, when considered enmasse, enrichment of prompts with domain-relevant information from Wikidata can improve cloze-style genre prediction in the movie domain. This is the case for both of the investigated verbalisation strategies. Note: * denotes that the p-value is 0 to at least 3 significant figures.

Table 3

Results for separate dependent one-tailed t-tests under the alternative hypothesis that the average difference between the enriched and non-enriched prompts is non-zero in favour of the enriched case. A p-value less than 0.05 means that we accept the alternative hypothesis with a 5% chance of Type I error.

It is noteworthy, however, that the BERT and RoBERTa large models behave very differently in terms of both their non-enriched performance and their performance when subject to varying levels of enrichment. This is demonstrative of the potential for PLM improvement via prompt enrichment as being highly specific to the model in question. BERT demonstrates optimum recall performance in aggregate for those enriched prompts with relatively low levels of information enrichment, followed by a very rapid reduction in recall@n for further enriched prompts. Whereas RoBERTa large demonstrates fluctuating performance relative to the non-enriched prompt, with the greatest performance shown in the more information-rich prompts.

It is beyond the scope of this paper to disentangle the role of information variety and the specific information types themselves, as to the influence on prediction outcomes. However, there are preliminary indications of complex interactions. For example, as shown in Table 2, prompt 7 (verbalisation strategy A) applied to RoBERTa large shows a huge spike in improved performance over the worst performing prompt 6, which adds the release date information. Analysis of a verbalisation strategy A prompt enriched only by release date alone, explains a large portion of the improvement (recall@1 = 0.167, recall@5 = 0.48). However, the overall context provided by prompt 7 results in the best performance overall: A one-tail dependent t test between prompt 7 and the case of enrichment by only release date, demonstrates significant non-zero differences, in the direction of greater prompt 7 performance for each of recall@1 and recall@5. Both tests reporting a p-value close to 0, with respect to a 0.05 significance. Accordingly, the results are suggestive of further investigative work being required to understand better the interactive effect of information enrichment on whatever model, domain, and task to which such enriched prompts may be applied.

Conclusion

Given that PLMs are limited in performance for domain-specific cloze-style question answering prompts, in this paper we examine how adding additional context to naive prompts from KGs can improve the performance of PLMs on a movie genre prediction task. Through our experiments, we show a statistically significant improvement in recall on prompts enriched with information from the Wikidata KG in comparison to non-enriched prompts on the BERT and RoBERTa large PLMs.

As future work, we plan to expand our study to include more domains such as books, music etc. to better understand domain-specific optimum characteristics for enrichment, and cover the same domains as similar previous work [13]. Additionally, we look forward to enriching prompts using web entities [22]. These entities are embedded in HTML pages on the web using Microformat, Microdata and RDFa from the Common Crawl web corpus, the largest and most up-to-date data web corpus available to the public. As more and more websites embed structured data describing for instance products, people, organizations, places, events, resumes, and cooking recipes, the engineered prompts covered domain-specific knowledge that is not present in the encyclopedic Wikidata.

Figure 1 :1Figure 1: Proposed framework (a) typical querying setup for a Masked Language Model prediction. (b) Proposed approach to enrich the query using external language.

Table 225

Soft prompts are learnt at the embedding level.

Acknowledgements

We would like to thank the International Semantic Web Summer School 2022, which initiated the collaboration between the authors in producing this paper. This work was funded in-part by: 'Culturally Aware AI' funded by NWO, the ANR-19-CE23-0014 DeKaloG project (CE23 -Intelligence artificielle) and the CominLabs MiKroloG project, Samsung Research UK. This project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No 101004746.

Wikidata property Property Label Enrichment Text wdt:P161 cast member starring wdt:P57 director directed by wdt:P162 producer produced by wdt:P58 screenwriter screenwriter wdt:P86 composer music by wdt:P1040 film editor edited by wdt:P577 year released wdt:P750 distributed by distributed by wdt:P495 country of origin originating from

JDevlin M.-WChang KLee KToutanova ArXiv abs/1810.04805 Bert: Pre-training of deep bidirectional transformers for language understanding 2019 Language models are few-shot learners TBrown BMann NRyder MSubbiah JDKaplan PDhariwal ANeelakantan PShyam GSastry AAskell SAgarwal AHerbert-Voss GKrueger THenighan RChild ARamesh DZiegler JWu CWinter CHesse MChen ESigler MLitwin SGray BChess JClark CBerner SMccandlish ARadford ISutskever DAmodei Advances in Neural Information Processing Systems HLarochelle MRanzato RHadsell MBalcan HLin Curran Associates, Inc 2020 33 Observing lod using equivalent set graphs: it is mostly flat and sparsely linked LAsprino WBeek PCiancarini FVHarmelen VPresutti International Semantic Web Conference Springer 2019 Adaptive prompt learning-based few-shot sentiment analysis PZhang TChai YXu ArXiv abs/2205.07220 2022 Building a personalized dialogue system with prompt-tuning TKasahara DKawahara NTung SLi KShinzato TSato ArXiv abs/2206.05399 2022 Enhancing cross-lingual natural language inference by promptlearning from cross-lingual templates KQi HWan JDu HChen Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics Long Papers the 60th Annual Meeting of the Association for Computational Linguistics 2022 1 Knowledgeable prompt-tuning: Incorporating knowledge into prompt verbalizer for text classification SHu NDing HWang ZLiu JWang JLi WWu MSun Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics Long Papers the 60th Annual Meeting of the Association for Computational Linguistics 2022 1 BHeinzerling KInui ArXiv abs/2008.09036 Language models as knowledge bases: On entity representations, storage capacity, and paraphrased queries 2021 Transfer learning in natural language processing SRuder MEPeters SSwayamdipta TWolf Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: Tutorials the 2019 conference of the North American chapter of the association for computational linguistics: Tutorials 2019 PLiu WYuan JFu ZJiang HHayashi GNeubig arXiv:2107.13586 Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing 2021 arXiv preprint TMickus DPaperno MConstant KVan Deemter ArXiv abs/1911.05758 What do you mean, bert? assessing bert as a distributional semantics model 2019 How can we know what language models know? ZJiang FFXu JAraki GNeubig 10.48550/ARXIV.1911.12543 2019 What does bert know about books, movies and music? probing bert for conversational recommendation GPenha CHauff Fourteenth ACM Conference on Recommender Systems 2020 Eliciting knowledge from language models using automatically generated prompts TShin YRazeghi RL LIv EWallace SSingh ArXiv abs/2010.15980 2020 Noisy channel language model prompting for few-shot text classification SMin MLewis HHajishirzi LZettlemoyer Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics Long Papers the 60th Annual Meeting of the Association for Computational Linguistics 2022 1 Exploiting cloze-questions for few-shot text classification and natural language inference TSchick HSchütze Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume 2021 YLu MBartolo AMoore SRiedel PStenetorp arXiv:2104.08786 Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity 2021 arXiv preprint PWest CBhagavatula JHessel JDHwang LJiang RLBras XLu SWelleck YChoi ArXiv abs/2110.07178 Symbolic knowledge distillation: from general language models to commonsense models 2021 Atomic: An atlas of machine commonsense for if-then reasoning MSap RLeBras EAllaway CBhagavatula NLourie HRashkin BRoof NASmith YChoi Proceedings of the AAAI conference on artificial intelligence the AAAI conference on artificial intelligence 2019 33 Roberta: A robustly optimized BERT pretraining approach YLiu MOtt NGoyal JDu MJoshi DChen OLevy MLewis LZettlemoyer VStoyanov CoRR abs/1907.11692 2019 The movielens datasets: History and context, Acm transactions on interactive intelligent systems FMHarper JAKonstan tiis) 5 2015 HMühleisen CBizer Web data commons-extracting structured data from two large web corpora LDOW 2012