1. Introduction

Task-specific Pre-training and Prompt Decomposition for Knowledge Graph Population with Language Models

Jef Z. Pan

0 1

Tianyi Li

tianyi.li@ed.ac.uk 1

Wenyu Huang

Nikos Papasarantopoulos

nikos.papasarantopoulos@huawei.com 0

Pavlos Vougiouklis

pavlos.vougiouklis@huawei.com 0

Workshop Proceedings

0 Huawei Edinburgh Research Centre , United Kingdom 1 ILCC, School of Informatics, University of Edinburgh , United Kingdom

We present a system for knowledge graph population with Language Models, evaluated on the Knowledge Base Construction from Pre-trained Language Models (LM-KBC) challenge at ISWC 2022. Our system involves task-specific pre-training to improve LM representation of the masked object tokens, prompt decomposition for progressive generation of candidate objects, among other methods for higher-quality retrieval. Our system is the winner of track 1 of the LM-KBC challenge, based on BERT LM; it achieves 55.0% F-1 score on the hidden test set of the challenge.1 Knowledge graph population is a task of predicting the objects from given subject-relation pairs. For example, for the subject-relation pair < , StateSharesBorderState >, the task is to predict the appropriate objects such as Faro, Beja, Gibraltar, etc . The task of knowledge graph population is highly related to the task of link prediction in the knowledge graph and Natural Language Processing (NLP) literature [1, 2]; the key diference is that, in knowledge graph population the objects are generated not from a fixed pool of entity nodes, but from an open vocabulary of words.

1. Introduction

http://knowledge-representation.org/j.z.pan/ (J. Z. Pan)

1Our code and data are available at https://github.com/Teddy-Li/LMKBC-Track1.

Our system falls in the track 1 of the challenge: it seeks to improve BERT [ 3 ] language model’s performance in knowledge graph population from the following three dimensions: 1) LM representation of masked object tokens; 2) candidate object generation; 3) candidate object selection (ranking). For improving LM representations, we apply task-specific pre-training, utilizing silver data retrieved from Wikidata [ 4 ] to aid the training process; for candidate generation, we use prompt decomposition to convert complex knowledge graph population tasks into multiple simpler tasks; for candidate selection, we use adaptive thresholds, with explorations to methods for relaxing the single-true-object assumption behind Softmax normalization.

In comparison to the winning submission in track 2 [ 5 ], we highlight the following contributions: 1) we show the efectiveness of task-specific pre-training, particularly when doing so separately for each individual relation; 2) we propose to decompose prompts to split the task into multiple steps, in order to achieve best results under the constraint of LM size and capability.

Below, we discuss the above three dimensions of improvements in details in Section 2, 3, 4, then describe our main experiment results in 5.

Our method is based on BERT-large-cased LM1, since as a general observation we have found cased BERT models to outperform uncased ones; we speculate that this can be attributed to an explicit distinction between named entities and general nouns by capitalization of the first characters. For all supervised experiments, we split the train set further into a train2 and a dev2 set with respective portions of 80% and 20%. We use the train2 set for training and the dev2 set for checkpointing; this way the dev set is kept as a hidden evaluation dataset, on which we report results through sections 2 to 4.

2. Task-Specific Pre-training for Better Representations

Language models like BERT [ 3 ] have been trained on diverse texts in large scale; therefore, by “reminding” the models of what type of information they are supposed to recall from pretraining, performance is expected to be improved. Along this line, [ 6 ] have shown that adaptive pre-training is helpful for language models’ performance on target domains. In our system, we explore the approach of training the BERT language model under MLM objective with the subject-relation-object triples.

The task-specific pre-training approach can be summarized as follows: given a subjectrelation-object triple, we use the triple to instantiate the corresponding prompt template, to create a sentence. We then mask those tokens in the sentence that are relevant to the object entity, and train the BERT models with the masked sentence, where the objective is to recover these masked tokens.

One interesting dimension of freedom in this task is which tokens to mask. This is motivated by our end task, to predict the tokens in the place of objects. Therefore, the representation of those object tokens is what we are most keen on improving. We further hypothesize that improving the representation of tokens close to the object tokens2 may also help with the 1https://huggingface.co/bert-large-cased 2For instance, the tokens “a” and “.” in the sentence “A cat sits on a mat .” prediction of object tokens. Thus, in summary, we mask the tokens corresponding to the object, as well as tokens beside the object tokens, up to a window of tokens on each side.

Another interesting dimension of freedom is what training data to use. As a baseline we have the training set from the LM-KBC challenge to use for generating sentences; however, the scale of the training set (i.e. 100 subjects per relation) is very small even for a fine-tuning dataset. To mitigate this data sparsity issue, we further refer to Wikidata [ 4 ] for more data entries, collecting the set of subject-object pairs recorded in Wikidata as satisfying each relation. Notably, to maintain the integrity of the evaluation, we exclude any overlaps between the subjects in any subset of the challenge from the subjects in the retrieved entity-pairs3. 3We exclude entries by subject because: 1) it is more secure to exclude subject mentions with arbitrary objects than to exclude subject-object pairs; 2) for the challenge test set only the subjects are available, so by excluding overlaps by subjects, we ensure our models are not peeking the test set in any way.

As illustrated in Table 1, 2, we find mixed results from our initial experimentation with task-specific pre-training: the performance dramatically improves with some relations, and dramatically drops on others. The trend is consistent across diferent configurations of intermediate pre-training, while the exact values slightly difer across diferent window sizes with no dominant configuration. This implies, for knowledge graph population with BERT language models, one size doesn’t fit all; we need separately-fine-tuned LM checkpoints for diferent relations to for the best results. Adapting BERT to attend to diferent relations separately would be impractical with only the challenge training set because of the small sizes of training data; however, with the much larger silver datasets retrieved from Wikidata, we are now able to elicit a family of BERT checkpoints, each dedicated to one or a few relations, where diferent checkpoints are reminded of diferent types of factual knowledge. When jointly used for link prediction, our family of BERT checkpoints exhibit superior performance over any single BERT model, as shown in Table 5.

As an additional comment, we present the sizes of our additional MLM training data in Table 3, where we show that there is not a clear dependency between the sizes of the training sets and the performance of task-specifically pre-trained checkpoints. This means, the discrepancy in performance of this MLM training is not strongly related to the sizes, but rather related to the properties of the knowledge required for each relation.

3. Prompt Decomposition for Improved Candidate Generation

In this section, we discuss the approaches explored for generating better candidate objects by prompt-based link-prediction. Our eforts here can be broadly classified into two categories: using better prompts and decomposing the prompts.

3.1. Prompt Elicitation

On the elicitation of better prompts, we experimented with both manual and automatic approaches. For the relation PersonInstrument, in order to help BERT ground the names to the corresponding musicians, we explicitly provide the entity type “musician” as part of the prompt: “The musician [SUBJ] plays [OBJ], which is an instrument”. For the relation PersonEmployer, we simplify the prompt into a concise sentence to the same efect: “[SUBJ] works at [OBJ]”.

For automatic elicitation of better prompts, we follow [ 7 ] in retrieving sentences from Wikipedia as potential prompts. First, we split the Wikipedia passages into sentences. Then we check each sentence with all the subject-object pairs in the LM-KBC dataset to identify if the sentence contains both entities with exact text matches under lowercase. For generating prompts from the selected sentences, we follow the mining-based generation methods introduced by [ 7 ], which includes elicitation of middle-word prompts and dependency-based prompts. The middle-word prompts are generated by retaining words in between the subject-object pair. The dependency-based prompts are generated based on the dependency tree, where the prompt border are selected based on the left-most and right-most word of the shortest dependency path between the two entities.

In evaluation, we take the top-20 most frequent prompts in the challenge training set, and rank the performance of BERT LM on the training set using each of these retrieved prompts. Finally, the average of the top-performing prompts is used as the prediction score for each predicted object entity. We iteratively add more prompts to the average in the order of the ranking, and take the best-performing combination (on the training set of the challenge) that is at least 1% higher than the previous-best using less prompts.4

Results for this experiment are displayed in Table 6. Contrary to results reported in [ 7 ], by using a very similar approach over a similar set of relations, we are not observing the same scale of improvement with increasing number of prompts involved; in fact, most of the times the best F-1 score is reached with 1 prompt template, which often is the manually-written one. We argue that this diference is due to the diference in evaluation metrics: we care about the F-1 scores rather than the macro average accuracies, which attaches higher importance to the precision of methods.

3.2. Prompt Decomposition

One issue we found hindering the performance of the baseline is, for the relation StateSharesBorderState, the subjects and objects are in fact not always states. For instance, the subject “Andalusia” is a autonomous community, and the subject “Hebei” is a province. By calling these subjects “states”, as the baseline prompt does, BERT gets confused and outputs irrelevant object entities. To address this issue, we introduce a pre-condition prompt asking which kind of location the subject is, with the prompt template “[SUBJ], as a place, is a [MASK].” From all the candidate tokens that BERT generates for the mask, we look for the following set of keywords: [state, province, department, city, region]. The top-ranked keyword will be taken as the type of the subject. Then, the formal prompt would go like “[SUBJ] [KEYWORD] shares border with [MASK] [KEYWORD]”. In our experiments, we observe a positive efect from this amendment: 4The 1% margin is introduced to prevent overfitting.

ChemicalCompoundElement CompanyParentOrganization CountryBordersWithCountry

CountryOficialLanguage

PersonCauseOfDeath

PersonEmployer PersonInstrument

PersonLanguage PersonPlaceOfDeath

PersonProfession

RiverBasinsCountry StateSharesBorderState

Average the F-1 score for the relation “StateSharesBorderState” is increased from 0.112 in baseline to 0.162 just with this one change in prompt formulation.

4. Explorations for Candidate Selection

In this section, we discuss the approaches explored for selecting appropriate candidates from the distribution of candidate tokens as outputted above. Following the baseline, we consider only the top-100 candidates for each object-to-predict. One immediate observation from the baseline is, among the top predicted tokens, there often exists pronouns, such as me, them, it, or determiners, such as the, a, some. Thus, we remove these pronouns and determiners as a post-processing step to clean up the results.

Another observation from the baseline is, that the default threshold of 0.5 is too harsh for many relations, and could be relaxed to optimize the F-1 scores. To this end, we exhaustively search the thresholds between 0 and 0.95 by steps of 0.01, and select the best thresholds based on training set F-1.

Apart from the above, there can be diferent numbers of answer objects for diferent subjectrelation pairs, and we notice that for those entries with larger numbers of answers, generally there are more candidate objects with a substantial normalized prediction score. This is in contradiction to the assumption behind normalized prediction scores: normalized prediction scores are contemplated as distributions over the tokens, where there is supposed to be only one true answer. But when there are multiple confidently-predicted candidates, the prediction score of each of them is diluted. Therefore, by setting a common threshold to all entries, the answers for those entries with more true answers are disadvantaged, and have a larger chance of being missed out.

To mitigate this efect, we first tried removing Softmax from the BERT MLM prediction head. The idea is, by removing the normalization and exposing the raw scores, all confident predictions should receive high scores and thus can be thresholded equally. However, experiment results show that by removing the Softmax function performance drops consistently across all relations. We speculate that this is because the range of raw prediction scores vary from sentence to sentence, without a normalization operation the scores themselves are too noisy.

We further tried keeping the Softmax layer, but additionally introducing sticky thresholds. That is, we rank the candidate objects by prediction scores and iterate over them, when a ChemicalCompoundElement CompanyParentOrganization CountryBordersWithCountry

CountryOficialLanguage

PersonCauseOfDeath

PersonEmployer PersonInstrument

PersonLanguage PersonPlaceOfDeath

PersonProfession

RiverBasinsCountry StateSharesBorderState

Average null null 0.4 0.91 null 0.76 0.43 null null 0.49 0.85 0.64 NA candidate object does not have enough prediction score to meet the threshold, but is relatively close to its previous candidate (for instance, > 80% the prediction score of its previous candidiate), we accept this candidate as well. We search for optimal sticky ratios along with the thresholds. Unfortunately, we observe that while for many relations the best F-1 score is reached with non-empty sticky ratios, only a very slight improvement is achieved, as shown in Table 8.

5. Final Results

In Table 9 is a summary of the techniques we’ve tried. It is not surprising that applying an adaptive threshold scheme brings substantial improvements; on the other hand, it is interesting how task-specific MLM training brings another pronounced boost in performance. Prompt decomposition shows an moderate but convincing improvement, whereas the improvement from adding in the sticky ratios is negligible.

In Table 10 are the final results for our system on the challenge test set, as recorded on the leaderboard5. This final set of results is acquired under the following setup: we use the family of 5https://codalab.lisn.upsaclay.fr/competitions/5815 BERT LM checkpoints based on BERT-large-cased, as presented in Table 5 in Section 2; we use our manually updated set of prompts (for computation speed) as in Section 3.1, with thresholds as assigned in Section 4, ignoring sticky ratios; we use the type-assignment decomposition for “StateSharesBorderState”.

6. Conclusions

We have explored methods to improve Knowledge Graph population with LMs under the track 1 constraint of using BERT as the language model backbone. In particular, we explored improving the LM representation, candidate object generation and candidate selection. We have made significant progress against the baseline method, and have also found remaining issues, which, if addressed, would bring further gain in performance and/or versatility. We highlight the following as promising areas of future work: 1) eficient intermediate fine-tuning for arbitrary relations; 2) automatic prompt decomposition, with more powerful LM backbones; 3) alternative re-ranking methods for independent judgement of candidate validity.

Acknowledgments

The authors thank the challenge organizers for their timely and helpful response to inquiries, and the reviewers for their valuable comments. This work is supported in part by a Mozzila Informatic PhD scholarship.

[1]

Socher ,

Chen ,

C. D.

Manning ,

Ng , Reasoning With Neural Tensor Networks for Knowledge Base Completion , in: Advances in Neural Information Processing Systems , volume 26 , Curran

Associates

, Inc., 2013 . URL: https://papers.nips.cc/paper/2013/hash/ b337e84de8752b27eda3a12363109e80-Abstract.html.

[2]

Bordes ,

Usunier ,

Garcia-Duran ,

Weston ,

Yakhnenko , Translating Embeddings for Modeling Multi-relational Data , in: Advances in Neural Information Processing Systems , volume 26 , Curran

Associates

, Inc., 2013 . URL: https://papers.nips.cc/paper/2013/hash/ 1cecc7a77928ca8133fa24680a88d2f9-Abstract.html.

[3]

Devlin , M.-

Chang ,

Lee ,

Toutanova , BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics , Minneapolis, Minnesota, 2019 , pp. 4171 - 4186 . URL: https://aclanthology.org/ N19-1423. doi: 10 .18653/v1/ N19 - 1423.

[4] Wikidata

a free collaborative knowledgebase: Communications of the ACM : Vol 57 , No 10, ???? URL: https://dl.acm.org/doi/10.1145/2629489.

[5]

Alivanistos ,

S. B.

Santamaría ,

Cochez ,

J.-C.

Kalo , E. van Krieken,

Thanapalasingam , Prompting as Probing: Using Language Models for Knowledge Base Construction , 2022 . URL: http://arxiv.org/abs/2208.11057, arXiv: 2208 .11057 [cs].

[6]

Gururangan ,

Marasović ,

Swayamdipta ,

Lo ,

Beltagy ,

Downey ,

N. A.

Smith , Don't Stop Pretraining: Adapt Language Models to Domains and Tasks, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics , Online, 2020 , pp. 8342 - 8360 . URL: https://aclanthology.org/ 2020 .acl-main. 740 . doi: 10 .18653/v1/ 2020 .acl- main.740.

[7]

Jiang ,

F. F.

Xu ,

Araki , G. Neubig, How Can We Know What Language Models Know?, Transactions of the Association for Computational Linguistics 8 ( 2020 ) 423 - 438 . URL: https://aclanthology.org/ 2020 .tacl- 1 .28. doi: 10 .1162/tacl_a_ 00324 , place: Cambridge, MA Publisher: MIT Press.