1. Introduction

Knowledge Base Construction from Pre-trained Language Models by Prompt learning

Xiao Ning

0 1

Remzi Celebi

0 0 Institute of Data Science, Faulty of Science and Engineering, Maastricht University , the Netherlands 1 School of Biological Science and Medical Engineering, Southeast University , China

Pre-trained language models (LMs) have advanced the state-of-the-art for many semantic tasks and have also been proven efective for extracting knowledge from the models itself. Although several works have explored the capability of the LMs for constructing knowledge bases, including prompt learning, this potential has not yet been fully explored. In this work, we propose a method of extracting factual knowledge from LMs for given subject-relation pairs and explore the most efective strategy to generate blank object entities for each relation of triples. We design prompt templates for each relation using personal knowledge and the descriptive information available on the web such as WikiData. The probing approach of our proposed LMs is tested on the dataset provided by the International Semantic Web Conference (ISWC 2022) LM-KBC Challenge. To cope with the problem of varying performance for each relation, we designed a parameter selection strategy for each relation. Using the test dataset, we obtain an F1-score of 0.4935%, which is higher than the baseline of 31.08%.

eol>Prompt learning Pre-trained language model Information Extraction Link Prediction

1. Introduction

tasks based on a gradient-guided search [ 6 ]. Prompt learning does not require a large amount of labeled data or introduce a large number of additional parameters, which leads to a more useful analysis tool and has been widely used in many domains, such as name entity recognition [ 7 ], information extraction [ 8 ], question answer [ 9 ]. Nevertheless, prompting requires the manual design of the context to feed into the model, designing eficient prompt templates directly afects the performance of the model.

In this work, we develop a system for track 1 of the LM-KBC challenge, a challenge that aims to explore the viability of knowledge base construction from BERT [ 1 ] with low computational requirements. We propose an automatic method to systematically improve the performance of the prompts used to query the relations from pre-trained model. Our method is based on bertlarge-cased 1 due to existing studies demonstrating its outstanding performance. This method is based on mining or paraphrasing that takes one prompt feed to the model. Considering that diferent prompts may have performance diferences when used to query diferent relations, we also combined answers from diferent prompts together. The data, code and learned models associated with this work can be accessed in the Github repository 2.

2. Prompt Generation

We define prompt generation as the task of generating a set of prompts ,=1 for each relation r, where at least some of the prompts efectively trigger LMs to predict ground-truth object-entities. Our method is inspired by template-based relation extraction methods, which are based on the observation that words in the vicinity of the subject s and object o in a large corpus often describe the relation r. We got the corresponding alternative description for each relation from the descriptive information in WikiData. Inspired by a template-based approach for relation extraction, we created prompt templates based on diferent descriptive information combined with professional knowledge. The three main method [ 5 ] we used in this challenge are below.

Middle-word Prompt Based on the observation that words in the middle of the subject and object are often indicative of the relation, we directly use those words as prompts. For example, Sergey Brin set up Google is converted into a prompt s set up o by replacing the subject and object with placeholders. For CountryBordersWithCountry relation, we design " {_} shares border with {_}." as one prompt.

Dependency-based Prompt In cases of templates where words do not appear in the middle, templates based on syntactic analysis of the sentence can be more efective for relation extraction tasks [ 10 ]. For instance, the dependency path in The capital of China is Beijing giving a prompt of capital of s is o. For CompanyParentOrganization relation, we designed "The parent organisation of {_} is {_}." as one prompt.

Paraphrasing-based Generation To improve lexical diversity while remaining relatively faithful to the original prompt, we paraphrased the original prompt with other semantically similar or identical expressions. When the prompt is s shares a border with o, it may be paraphrased as s borders with o and s is next to o. This is conceptually similar to the query expansion

1https://huggingface.co/bert-large-cased

2https://github.com/xiao-nx/LMKBC_2022 techniques which are used in information retrieval to reformulate a given query to improve retrieval performance [ 11 ].

3. Prompt Selection and Ensemble

In the previous section, we describe methods of generating a set of candidate prompts {,}=1 for a particular relation r. Each of these prompts may be more or less efective in eliciting knowledge from the LMs, and thus it is necessary to decide how to use these generated prompts during the test. In this section, we discuss the approaches explored for generating better candidate objects by prompt-based link-prediction. Our eforts here can be broadly classified into two categories: using better prompts and ensemble the prompts.

3.1. Selection of the Top-k Prompts

To find the prompts which better elicit the pre-trained model better, we designed prompts considering both a priori knowledge and synonyms as potential prompts. For each prompt, we can measure its precision, recall and F1-score of predicting the ground-truth objects on the training data, and keep several the top-performing prompts based on F1-score.

3.2. Ensemble Prompts

We do not observe the same scale of improvement with increasing number of prompts involved; in fact, most of the time the best F-1 score is achieved with one prompt template. We argue that this diference is due to the diference in the evaluation metrics: we pay attention to the F-1 scores rather than the macro-averaged accuracy scores, which give higher importance to the precision of methods. Therefore, considering that having a variety of prompts may allow for elicitation of knowledge that appeared in these diferent contexts, we rank all the prompts based on their performance of predicting the objects in the training set and keep the prompts with an F1-score higher than 0.1 or top 5. Although treating the top-k prompts equally is sub-optimal as some prompts are more reliable than others.

For every relation in the dataset, we use all filtered prompts to query the pre-trained language model, and every prompt will return a set of object entities. Then it is important to select the most accurate object entities. Here, we developed an algorithm that considers synthetically the frequency and probability of each predicted object-entities, and finally keep the top 5 candidates. Note that there often exist pronouns, such as him, them, it, or determiners, such as the, a, any in the top predicted objects, or other symbols, such as ?, 1970s, -s, so we removed these words. In addition to that, we mapped the music in the predicted result into producer, acting into actor, teacher into professor, water into hydrogen.

4. Experiments 4.1. Dataset

The dataset for this challenge is divided into a training data, development data and test data, each covering a diferent set of subject-entities and along with complete list ground-truth object-entities per subject-relation-pair. The training data subject-relation-object triples can be used for training or probing the language models in any form, while development can be used for hyper-parameter tuning, and the test data is used to measure the performance of the final submitted system. Our proposed method is free from finetuning, so we just use the training data to test the performance of system tool and adjust parameters manually, then submitted the developed system tool.

4.2. Experimental Settings

Single Prompt Experiments For each prompt we designed, its corresponding performance was tested on the training set. The performance of top-3 is shown in Table 1.

Ensemble Prompts Experiments For some relations with low recall, we combined several prompts and rank them as the final results to get more object entities. We labeled the top 3 prompts as prompt1, prompt2, prompt3, and tried to evaluate the performance of ensemble [prompt1, prompt2, prompt3], [prompt1, prompt1], [prompt1, prompt3], [prompt2, prompt3] prompts and then took the best performing combination on the training data.

Search Threshold Experiments Another observation is that the threshold strongly afects the recall of the prediction results, and it is possible to obtain more object entities by lowering the threshold. Thus, we searched various thresholds to optimize the F-1 scores, and select the best thresholds based on the training data. According to the formula of F1 score, it is known that the F1 score achieves its maximum value when the accuracy and recall are close to each other, therefore we adjusted the threshold to search for the F1 score of optimal performance. Actually, in our experiments we performed only a small range of searching, but in order to show the efect of threshold on F1 socore clearly, we search the thresholds between 0.01 and 0.99 by steps of 0.01 and plotted in the Figure 1.

System Tool In this section, We present the prompt or the combination of prompts used for each relation and the corresponding threshold value, as shown in Table 2.

4.3. Final Test Results

As for the models to probe, in our main experiments, we use the BERT-large models. We use three metrics to evaluate the success of the prompts in probing LMs, precision, recall, and F1-score. The final performance of our proposed method on the test data can be seen in Table 3, as recorded on CodaLab 3.

3https://codalab.lisn.upsaclay.fr/competitions/5815

s denotes {_}, o denotes {_}.

5. Conclusion

Prompt learning exploits the powerful capability of pre-trained language models, and significantly minimizes the dependence on supervised data. Prompt learning enables shot learning and even zero shot learning, which is a promising application for NLP downstream tasks, especially 1.0 0.9 0.8 0.1 ChemicalCompoundElement CompanyParentOrganization CountryBordersWithCountry CountryOfficialLanguage PersonCauseOfDeath PersonEmployer PersonInstrument PersonLanguage PersonPlaceOfDeath PersonProfession RiverBasinsCountry StateSharesBorderState information extraction. In this paper, we have applied diferent prompting techniques to extract factual knowledge from pre-trained language models. We also designed various templates to generate diverse prompts to query specific pieces of relational knowledge. Experiments show that LMs are indeed reliable knowledge sources than initially indicated by previous results, but they are also quite sensitive to the way we query them. We have made significant success compared to the baseline method by generating more efective prompts, ensemble prompts and search diferent thresholds. It is promising to improve the accuracy of factual knowledge retrieval by prompt design strategies for each relation. However, how to create a prompt, how to select the language model, how to construct answer candidates, how to map answers to final outputs, and how to find an optimal configuration for downstream tasks is still an on-going exploration.

6. Acknowledgments

Thanks to Shuai Wang, an excellent software engineer from Amazon, he introduced several practical scripts for me to automate run the code, which significantly increased the eficiency of experiments. Furthermore, the experiment part of this research was made possible, in part, using the Data Science Research Infrastructure (DSRI) hosted at Maastricht University.

Relations ChemicalCompoundElement CompanyParentOrganization CountryBordersWithCountry

CountryOficialLanguage

PersonCauseOfDeath

PersonEmployer PersonInstrument

PersonLanguage PersonPlaceOfDeath

PersonProfession

RiverBasinsCountry StateSharesBorderState

Average

[1]

Devlin , M.-

Chang ,

Lee ,

Toutanova , Bert: Pre-training of deep bidirectional transformers for language understanding , arXiv preprint arXiv: 1810 . 04805 ( 2018 ).

[2]

Liu ,

Ott ,

Goyal ,

Du ,

Joshi ,

Chen ,

Levy ,

Lewis ,

Zettlemoyer ,

Stoyanov , Roberta: A robustly optimized bert pretraining approach , arXiv preprint arXiv: 1907 . 11692 ( 2019 ).

[3]

Liu ,

Yuan ,

Fu ,

Jiang ,

Hayashi , G. Neubig, Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing , arXiv preprint arXiv:2107.13586 ( 2021 ).

[4]

Gao ,

Fisch ,

Chen , Making pre-trained language models better few-shot learners , arXiv preprint arXiv: 2012 . 15723 ( 2020 ).

[5]

Jiang ,

F. F.

Xu ,

Araki , G. Neubig, How can we know what language models know?, Transactions of the Association for Computational Linguistics 8 ( 2020 ) 423 - 438 .

[6]

Shin ,

Razeghi , R. L. Logan

, E. Wallace,

Singh , Autoprompt: Eliciting knowledge from language models with automatically generated prompts , arXiv preprint arXiv: 2010 . 15980 ( 2020 ).

[7]

Cui ,

Wu , J. Liu,

Yang , Y. Zhang, Template-based named entity recognition using bart , arXiv preprint arXiv:2106.01760 ( 2021 ).

[8]

Lu ,

Liu ,

Dai ,

Xiao ,

Lin ,

Han , L . Sun, H. Wu, Unified structure generation for universal information extraction , arXiv preprint arXiv:2203.12277 ( 2022 ).

[9]

Lazaridou ,

Gribovskaya ,

Stokowiec ,

Grigorev , Internet-augmented language models through few-shot prompting for open-domain question answering , arXiv preprint arXiv:2203.05115 ( 2022 ).

[10]

Toutanova ,

Chen ,

Pantel ,

Poon ,

Choudhury ,

Gamon , Representing text for joint embedding of text and knowledge bases , in: Proceedings of the 2015 conference on empirical methods in natural language processing , 2015 , pp. 1499 - 1509 .

[11]

Carpineto ,

Romano , A survey of automatic query expansion in information retrieval, Acm Computing Surveys (CSUR) 44 ( 2012 ) 1 - 50 .