Improving Language Model Predictions via Prompts Enriched with Knowledge Graphs⋆ Ryan Brate1 , Minh-Hoang Dang2 , Fabian Hoppe3,4 , Yuan He5 , Albert Meroño-Peñuela6 and Vijay Sadashivaiah7 1 KNAW Humanities Cluster, Digital Humanities Lab, Amsterdam, Netherlands 2 LS2N, Université de Nantes, Faculté des Sciences et Techniques (FST), France 3 FIZ Karlsruhe, Leibniz Institute for Information Infrastructure, Germany 4 Karlsruhe Institute of Technology, Institute AIFB, Germany 5 University of Oxford, UK 6 King’s College London, UK 7 Rensselaer Polytechnic Institute, USA Abstract Despite advances in deep learning and knowledge graphs (KGs), using language models for natural language understanding and question answering remains a challenging task. Pre-trained language models (PLMs) have shown to be able to leverage contextual information, to complete cloze prompts, next sentence completion and question answering tasks in various domains. Unlike structured data querying in e.g. KGs, mapping an input question to data that may or may not be stored by the language model is not a simple task. Recent studies have highlighted the improvements that can be made to the quality of information retrieved from PLMs by adding auxiliary data to otherwise naive prompts. In this paper, we explore the effects of enriching prompts with additional contextual information leveraged from the Wikidata KG on language model performance. Specifically, we compare the performance of naive vs. KG-engineered cloze prompts for entity genre classification in the movie domain. Selecting a broad range of commonly available Wikidata properties, we show that enrichment of cloze-style prompts with Wikidata information can result in a significantly higher recall for the investigated BERT and RoBERTa large PLMs. However, it is also apparent that the optimum level of data enrichment differs between models. Keywords Prompt Learning, Pre-trained Language Model, Knowledge Graph. 1. Introduction Pre-trained language models (PLMs) [1, 2], based on deep learning attention-based architectures, have shown to have outstanding performance at various natural language processing (NLP) tasks predicated on natural language understanding. However, the extent to which they capture domain knowledge and empirical semantics [3] — i.e. the use of formal domain properties Workshop on Deep Learning for Knowledge Graphs (DL4KG@ISWC2022), October 23-24, 2022 ⋆ Authors listed in alphabetical order $ r.brate@gmail.com (R. Brate); minhhoangdang@hotmail.com (M. Dang); fabian.hoppe@kit.edu (F. Hoppe); yuan.he@cs.ox.ac.uk (Y. He); albert.merono@kcl.ac.uk (A. Meroño-Peñuela); sadasv2@rpi.edu (V. Sadashivaiah)  0000-0002-7047-2770 (F. Hoppe); 0000-0002-4486-1262 (Y. He); 0000-0003-3375-3810 (V. Sadashivaiah) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) in practice — is not well understood. In this work, we narrow down the focus to cloze-style completion, the task of predicting the masked entity text in a sentence. For example, given: “The Klingons are a species in the franchise [MASK]”, the PLM is expected to predict “Star Trek” for [MASK]. It aims to extract the implicit knowledge entailed by the PLMs, since such knowledge can be used for downstream NLP applications like sentiment analysis [4], dialogue systems [5], and natural language inference [6], as well as for completing the missing information of knowledge graphs (KGs) or ontologies [7], and even constructing new ones [8]. In recent years, PLMs have improved on the state of the art in many NLP tasks by leveraging large text corpora [9], but most of time they still require annotated data for task-specific fine-tuning [10]. However, the empirical semantics gathered by these models is limited to distributional aspects [11]. Therefore, the performance, especially in the few- and zero-shot setting, highly depends on the provided prompt, i.e. snippets of contextual information for a specific task. However, in many cases the engineering of the prompts is naive and simplistic, giving the PLM too little context to provide an accurate answer, and unsystematic, providing little principles on how exactly these prompts need to be composed in order to have a predictable behaviour. Indeed, recent studies [12] have highlighted the improvements that can be made to the quality of information retrieved from PLMs by performing amendments to these prompts. This casts doubts on some studies [13] that claim that a PLM cannot answer easy questions about e.g. culture (movies, books, music, ...), it is reasonable to postulate that PLMs could perhaps answer those questions accurately if they were provided with systematically engineered prompts that contained richer contexts. Existing approaches of prompt engineering include: (i) learn-by-example, where the prompt consists of the concatenation of correct examples we expect a PLM to predict [2]; (ii) manually designed prompts of different granularities [13]; (iii) automatically searched prompts optimized on few-shot samples [14], all of which rely on the implicit semantics of natural language texts. In this paper, we investigate how incorporating explicit knowledge from external sources like KGs can help prompt engineering and thus enhance the cloze-style question answering of PLMs. Specifically, we explore cloze-style prompts with respect to the movie domain in respect of the performance of the BERT and RoBERTa large PLMs. 2. Related Work Studies towards prompt learning are based on the hypothesis that pre-trained language models (PLMs) have learnt abundant knowledge and just require sufficiently detailed contexts for predictions [2, 10, 15] — and in this way, it is possible to apply PLMs without data-driven fine-tuning. A (hard1 ) prompt is the conditioning text which is combinded with the input to provide contexts or hints for the PLM. A template (i.e. pattern) is a function that integrates the inputs and prompts. Answers are then given by the PLMs conditioned on the prompts, and a further function (i.e. verbalizer) is often required to map the answers to the final outputs. The reason for that is, the prompt learning paradigm is typically formulated as a similar task to the PLM’s pre-training task, which does not necessarily yield the desired outputs of downstream applications. 1 Soft prompts are learnt at the embedding level. An important part of prompt learning is prompt engineering, i.e., to design template(s), either manually or automatically, to support downstream applications. In [2], Brown et al. proposed to use demonstrations, i.e., a sequence of input-output texts, as the prompts, expecting that the PLM can implicitly learn to predict from examples. For instance, if we want the PLM to predict the masked position in “[MASK] is the capital of China.”, we can demonstrate by appending “London is the capital of the UK” after the masked sentence. Schick et al. [16] manually designed different templates, each corresponding to an individual PLM trained on few-shot examples. The predictions of downstream text classification and natural language inference tasks were then made according to an ensemble of trained PLMs. Shin et al. [14] argued that manually designed templates suffer from the uncertainty of guesswork or the lack of domain expertise. Therefore, they proposed to search for templates using gradient-based optimization. More recently, Lu et al. [17] have shown that PLMs performance varies with the order of these prompts, and use generative language models and entropy statistics on the prompt permutations to identify prompts with good performance. KGs or ontologies are excellent sources for providing explicit knowledge to enrich prompts or verbalizers. West et al. [18] considered distilling a student model in the common sense domain from the enormously large PLM GPT-3 [2], which serves as the teacher model. They adopted the prompt learning scheme to extract triples from the teacher model with templates created and examples extracted from the common sense KG Atomic [19]. Hu et al. [7] argued that the label word space (i.e., the answer space) can be well expanded by adding in external knowledge about related words. They employed different refinement heuristics to shortlist candidates to benefit the downstream classification task. For instance, if some “Person” is classified as a “Physicist” in the ground truth data, then answers like “Scientist” will also be accepted. Our work was motivated by the probing study of Penha et al. [13] that investigates whether BERT (a well-known PLM consisting of stacked transformer encoders [1]) actually knows superficial cultural knowledge about books, movies, and music. Cloze-style questions for classifying the genre of entities (from Wikidata) of different books, movies, and music were given for the PLM to answer, often with unsatisfying performance. However, their work considered naive prompts without sufficient contexts, while ours attempts to examine if KGs can enrich these prompts, especially giving additional contexts (e.g., attributes, 𝑘-hop neighbours) of the entities in order to help the PLM to generate better predictions. 3. Methodology The basic idea of our method is to use the information about entities in KGs to expand cloze-style prompts with richer entity descriptions. It is summarized in Figure 1. We enrich the naive prompt, for example Die Hard is of genre [MASK], through matching the movie Die Hard to the corresponding Wikidata item and extract auxiliary knowledge with SPARQL queries, and generating an enriched prompt using this auxiliary data. We use datatype properties and verbalize entities using rdfs:label to compose valid phrases. As a result, we obtain e.g. Die Hard is a movie, starring Bruce Willis, directed by John McTier- nan, of the genre [MASK]. We then use both (a) the naive prompts and (b) the KG-enriched prompts to query various language models, and compare their performance on the entity genre classification task. In the following paragraphs the enrichment by KG querying and the prompt engineering step are described in detail. (a) (b) Figure 1: Proposed framework (a) typical querying setup for a Masked Language Model prediction. (b) Proposed approach to enrich the query using external language. 3.1. Knowledge Graph Querying The auxiliary data for each movie is extracted from Wikidata. This is done in a simplistic two-step-process using SPARQL queries. The queries operate on a batch of input records to reduce the number of requests and avoid timeout errors. First, the movies are linked to their respected Wikidata entities by IMDb or TMDB ID utilizing the Wikidata properties IMDb ID (wdt:P345) and TMDb movie ID (wdt:P4947). If this does not provide an entity, an exact string matching given the title is attempted as well. SELECT ?mlId ?imdbId ?tmdbId ?movie WHERE { VALUES (?mlId ?imdbId ?tmdbId) {("1" "tt0114709" "862" ) ... } {?movie wdt:P345 ?imdbId . } UNION {?movie wdt:P4947 ?tmdbId .} } Listing 1: SPARQL query used for entity linking with the IMDb or TMDB ID. The second step queries the entities for the auxiliary data used to enrich the prompts with additional contextual information. Overall, a set of 28 properties was extracted and investigated for each entity. A simplified version of the utilized SPARQL query is given in 2. This query can easily be adapted to query other properties by adding these properties to the ?property values. From this set of properties a subset of 10 manually selected domain-specific properties are used to constract the enriched prompts. The properties are selected based on human intuition and the most frequent co-occurrence for the given entities. SELECT ?mlId (SAMPLE(?movieLabel) AS ?movieLabel) (SAMPLE(?propertyLabel) AS ?propertyLabel) (GROUP_CONCAT(DISTINCT ?objectLabel; SEPARATOR=", ") AS ?objectList) WHERE{ VALUES (?mlId ?movie) { ("1" ) ...} ?movie rdfs:label ?movieLabel . FILTER (LANG(?movieLabel)="en") VALUES ?property {wdt:P144 wdt:P179 ...} ?p1 wikibase:directClaim ?property . ?p1 rdfs:label ?propertyLabel . FILTER (LANG(?propertyLabel)="en") OPTIONAL { ?movie ?property ?object . ?object rdfs:label ?objectLabel . FILTER (LANG(?objectLabel)="en") } } GROUP BY ?mlId ?property Listing 2: Simplified SPARQL query used to retrieve additional movie knowledge from Wikidata. 3.2. Prompt Engineering Similarly to [13], we consider an entity genre classification task. The prompts are of the form: “ is a movie <Wikidata enrichment>, of the genre [MASK].”, where <Wikidata enrichment> is an aggregation of movie properties and corresponding values extracted from Wikidata pertaining the title in question, in some natural language format. Table 1 lists the Wikidata properties used to assemble values for <Wikidata enrichment>. Wikidata property Property Label Enrichment Text wdt:P161 cast member starring wdt:P57 director directed by wdt:P162 producer produced by wdt:P58 screenwriter screenwriter wdt:P86 composer music by wdt:P1040 film editor edited by wdt:P577 year released wdt:P750 distributed by distributed by wdt:P495 country of origin originating from Table 1 Wikidata properties used in constructing probes for the movie dataset. ’Enrichment Text’ is the text adopted in the probe enrichment to describe the property in question in a more natural language format. The Wikidata properties listed in Table 1 are broadly ranked in descending information specificity. It was in this order, that ten variations for a probe were constructed, by sequentially adding Wikidata properties to prompts, building gradually more contextual-information dense prompts. In adding property information, only the first value of each Wikidata property was used where more than one was available (e.g., the first listed cast member). E.g., as follows; the unenriched prompt, the first two successive prompt enrichments, and the final enriched form pertaining to the movie Die Hard.: • non-enriched prompt: Die Hard is a movie, of the genre [MASK]. • enriched Prompt 1(A): Die Hard is a movie, starring Bruce Willis, of the genre [MASK]. • enriched Prompt 2(A): Die Hard is a movie, starring Bruce Willis, directed by John McTier- nan, of the genre [MASK]. • enriched Prompt 9(A): Die Hard is a movie, starring Bruce Willis, directed by John Mc- Tiernan, produced by Joel Silver, screenwriter Roderick Thorp, music by Michael Kamen, edited by John F. Link, released 1988, distributed by Netflix, originating from United States of America, of the genre [MASK]. Given the potential for sensitivity of PLMs to the verbalisation strategy used in the construction of cloze-stype prompts, we considered two verbalisation strategies for aggregation of the additional Wikidata properties. Whereas the above verbalisation strategy A form is aggregated with commas, the verbalisation strategy B form is aggregated with and tokens. E.g.: • enriched Prompt 1(B): Die Hard is a movie and starring Bruce Willis, of the genre [MASK]. • enriched Prompt 2(B): Die Hard is a movie and starring Bruce Willis and directed by John McTiernan, of the genre [MASK]. • enriched Prompt 9(B): Die Hard is a movie and starring Bruce Willis and directed by John McTiernan and produced by Joel Silver and screenwriter Roderick Thorp and music by Michael Kamen and edited by John F. Link and released 1988 and distributed by Netflix and originating from United States of America, of the genre [MASK]. Thus, in total 19 prompt variations are considered for each movie. 4. Evaluation 4.1. Dataset In order to test our approach, we use the BERT [1] and RoBERTa large [20] pre-trained models. The test dataset we are using is a subset of ML25M from IMDB [21]. ML25M contains title and ground truth genre classification of a range of 54,758 movies. A subset of this dataset was then assembled, as those movies for which the Wikidata properties as listed in Table 1 were present in full. This resulted in a test set of 9,596 movie titles. The Wikidata properties, and thus the corresponding data subset, were selected as a compromise between a large dataset, and a diverse set of domain-relevant Wikidata properties, following exploratory analysis of the ML25M dataset. 4.2. Results Table 2 lists the recall@n scores for each of the prompts described in Section 3.2, for the BERT and RoBERTa large models respectively. For a given model and prompt, recall@1 and recall@5 values for each movie are calculated as the fraction of movie ground-truth genres predicted in the highest ranked n PLM mask predictions. The aggregated recall@n values reported in Table 2 are the micro-averaged recall@n scores across all movies in the test dataset, with respect to the model and prompt referenced. With reference to Table 2, certain variations of the enriched probes showed greater R@n scores that the non-enriched case, for both the BERT and RoBERTa large models, across ver- balisation strategies. We compare the statistical significance of the R@n outcomes, of the highest performing enriched prompts (bolded) against the non-enriched case, via a one-tailed, directional, dependent t-test. Where the null hypothesis is that average of the R@n differences is 0, and the alternative hypothesis is that the average of the R@n differences is non-zero, biased towards the selected enriched probe. A significance of 0.05 is applied. With reference to the p-values given in Table 3, we can affirm with statistical significance that the enriched prompts are more performant overall. BERT RoBERTa large Prompt Verbalisation Verbalisation Verbalisation Verbalisation Strategy A Strategy B Strategy A Strategy B R@1 R@5 R@1 R@5 R@1 R@5 R@1 R@5 non-enriched 0.136 0.448 0.136 0.448 0.065 0.198 0.065 0.198 1 0.139 0.476 0.153 0.487 0.096 0.264 0.076 0.204 2 0.161 0.515 0.161 0.498 0.114 0.297 0.016 0.068 3 0.092 0.423 0.065 0.305 0.057 0.180 0.005 0.031 4 0.024 0.226 0.036 0.258 0.010 0.100 0.004 0.038 5 0.017 0.176 0.011 0.090 0.034 0.115 0.012 0.043 6 0.004 0.104 0.020 0.062 0.013 0.053 0.012 0.044 7 0.055 0.320 0.021 0.214 0.191 0.556 0.203 0.534 8 0.047 0.250 0.006 0.065 0.163 0.466 0.142 0.389 9 0.020 0.140 0.014 0.083 0.184 0.536 0.210 0.576 Table 2 Recall@n scores for the Bert, RoBERTa large and the movie data subset, averaged over all movies. Verbalisation strategy A and B prompts consist of comma separated and ’and’ separated WikiData information, respectively, as described in Section 3.2. The greatest recall@n scores are highlighted in bold. 4.3. Discussion The results and analysis of Section 4.2 give support to the position that, when considered en- masse, enrichment of prompts with domain-relevant information from Wikidata can improve cloze-style genre prediction in the movie domain. This is the case for both of the investigated verbalisation strategies. mean test p-values difference statistic Verbalisation Prompt 2 R@1 0.0245 8.33 0* Strategy A Prompt 2 R@5 0.0672 21.0 0* BERT Verbalisation Prompt 2 R@1 0.0252 8.47 0* Strategy B Prompt 2 R@5 0.0506 15.2 0* Verbalisation Prompt 7 R@1 0.125 43.0 0* RoBERTa Strategy A Prompt 7 R@5 0.358 86.5 0* large Verbalisation Prompt 9 R@1 0.144 47.7 0* Strategy B Prompt 9 R@5 0.378 92.5 0* Note: * denotes that the p-value is 0 to at least 3 significant figures. Table 3 Results for separate dependent one-tailed t-tests under the alternative hypothesis that the average difference between the enriched and non-enriched prompts is non-zero in favour of the enriched case. A p–value less than 0.05 means that we accept the alternative hypothesis with a 5% chance of Type I error. It is noteworthy, however, that the BERT and RoBERTa large models behave very differently in terms of both their non-enriched performance and their performance when subject to varying levels of enrichment. This is demonstrative of the potential for PLM improvement via prompt enrichment as being highly specific to the model in question. BERT demonstrates optimum recall performance in aggregate for those enriched prompts with relatively low levels of information enrichment, followed by a very rapid reduction in recall@n for further enriched prompts. Whereas RoBERTa large demonstrates fluctuating performance relative to the non-enriched prompt, with the greatest performance shown in the more information-rich prompts. It is beyond the scope of this paper to disentangle the role of information variety and the specific information types themselves, as to the influence on prediction outcomes. However, there are preliminary indications of complex interactions. For example, as shown in Table 2, prompt 7 (verbalisation strategy A) applied to RoBERTa large shows a huge spike in improved performance over the worst performing prompt 6, which adds the release date information. Analysis of a verbalisation strategy A prompt enriched only by release date alone, explains a large portion of the improvement (recall@1 = 0.167, recall@5 = 0.48). However, the overall context provided by prompt 7 results in the best performance overall: A one-tail dependent t test between prompt 7 and the case of enrichment by only release date, demonstrates significant non-zero differences, in the direction of greater prompt 7 performance for each of recall@1 and recall@5. Both tests reporting a p–value close to 0, with respect to a 0.05 significance. Accordingly, the results are suggestive of further investigative work being required to understand better the interactive effect of information enrichment on whatever model, domain, and task to which such enriched prompts may be applied. 5. Conclusion Given that PLMs are limited in performance for domain-specific cloze-style question answering prompts, in this paper we examine how adding additional context to naive prompts from KGs can improve the performance of PLMs on a movie genre prediction task. Through our experiments, we show a statistically significant improvement in recall on prompts enriched with information from the Wikidata KG in comparison to non-enriched prompts on the BERT and RoBERTa large PLMs. As future work, we plan to expand our study to include more domains such as books, music etc. to better understand domain-specific optimum characteristics for enrichment, and cover the same domains as similar previous work [13]. Additionally, we look forward to enriching prompts using web entities [22]. These entities are embedded in HTML pages on the web using Microformat, Microdata and RDFa from the Common Crawl web corpus, the largest and most up-to-date data web corpus available to the public. As more and more websites embed structured data describing for instance products, people, organizations, places, events, resumes, and cooking recipes, the engineered prompts covered domain-specific knowledge that is not present in the encyclopedic Wikidata. Acknowledgements We would like to thank the International Semantic Web Summer School 2022, which initiated the collaboration between the authors in producing this paper. This work was funded in-part by: ‘Culturally Aware AI’ funded by NWO, the ANR-19-CE23-0014 DeKaloG project (CE23 - Intelligence artificielle) and the CominLabs MiKroloG project, Samsung Research UK. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 101004746. References [1] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, ArXiv abs/1810.04805 (2019). [2] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, D. Amodei, Language models are few-shot learners, in: H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, H. Lin (Eds.), Advances in Neural Information Processing Systems, volume 33, Curran Associates, Inc., 2020, pp. 1877–1901. URL: https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf. [3] L. Asprino, W. Beek, P. Ciancarini, F. v. Harmelen, V. Presutti, Observing lod using equivalent set graphs: it is mostly flat and sparsely linked, in: International Semantic Web Conference, Springer, 2019, pp. 57–74. [4] P. Zhang, T. Chai, Y. Xu, Adaptive prompt learning-based few-shot sentiment analysis, ArXiv abs/2205.07220 (2022). [5] T. Kasahara, D. Kawahara, N. Tung, S. Li, K. Shinzato, T. Sato, Building a personalized dialogue system with prompt-tuning, ArXiv abs/2206.05399 (2022). [6] K. Qi, H. Wan, J. Du, H. Chen, Enhancing cross-lingual natural language inference by prompt- learning from cross-lingual templates, in: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 1910–1923. [7] S. Hu, N. Ding, H. Wang, Z. Liu, J. Wang, J. Li, W. Wu, M. Sun, Knowledgeable prompt-tuning: Incorporating knowledge into prompt verbalizer for text classification, in: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 2225–2240. [8] B. Heinzerling, K. Inui, Language models as knowledge bases: On entity representations, storage capacity, and paraphrased queries, ArXiv abs/2008.09036 (2021). [9] S. Ruder, M. E. Peters, S. Swayamdipta, T. Wolf, Transfer learning in natural language processing, in: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: Tutorials, 2019, pp. 15–18. [10] P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, G. Neubig, Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing, arXiv preprint arXiv:2107.13586 (2021). [11] T. Mickus, D. Paperno, M. Constant, K. van Deemter, What do you mean, bert? assessing bert as a distributional semantics model, ArXiv abs/1911.05758 (2019). [12] Z. Jiang, F. F. Xu, J. Araki, G. Neubig, How can we know what language models know?, 2019. URL: https://arxiv.org/abs/1911.12543. doi:10.48550/ARXIV.1911.12543. [13] G. Penha, C. Hauff, What does bert know about books, movies and music? probing bert for conversational recommendation, Fourteenth ACM Conference on Recommender Systems (2020). [14] T. Shin, Y. Razeghi, R. L. L. IV, E. Wallace, S. Singh, Eliciting knowledge from language models using automatically generated prompts, ArXiv abs/2010.15980 (2020). [15] S. Min, M. Lewis, H. Hajishirzi, L. Zettlemoyer, Noisy channel language model prompting for few-shot text classification, in: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 5316–5330. [16] T. Schick, H. Schütze, Exploiting cloze-questions for few-shot text classification and natural language inference, in: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, 2021, pp. 255–269. [17] Y. Lu, M. Bartolo, A. Moore, S. Riedel, P. Stenetorp, Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity, arXiv preprint arXiv:2104.08786 (2021). [18] P. West, C. Bhagavatula, J. Hessel, J. D. Hwang, L. Jiang, R. L. Bras, X. Lu, S. Welleck, Y. Choi, Symbolic knowledge distillation: from general language models to commonsense models, ArXiv abs/2110.07178 (2021). [19] M. Sap, R. Le Bras, E. Allaway, C. Bhagavatula, N. Lourie, H. Rashkin, B. Roof, N. A. Smith, Y. Choi, Atomic: An atlas of machine commonsense for if-then reasoning, in: Proceedings of the AAAI conference on artificial intelligence, volume 33, 2019, pp. 3027–3035. [20] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, Roberta: A robustly optimized BERT pretraining approach, CoRR abs/1907.11692 (2019). URL: http://arxiv.org/abs/1907.11692. arXiv:1907.11692. [21] F. M. Harper, J. A. Konstan, The movielens datasets: History and context, Acm transactions on interactive intelligent systems (tiis) 5 (2015) 1–19. [22] H. Mühleisen, C. Bizer, Web data commons-extracting structured data from two large web corpora, in: LDOW, 2012.