<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Task-specific Pre-training and Prompt Decomposition for Knowledge Graph Population with Language Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jef Z. Pan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tianyi Li</string-name>
          <email>tianyi.li@ed.ac.uk</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Wenyu Huang</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nikos Papasarantopoulos</string-name>
          <email>nikos.papasarantopoulos@huawei.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pavlos Vougiouklis</string-name>
          <email>pavlos.vougiouklis@huawei.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop Proceedings</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Huawei Edinburgh Research Centre</institution>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>ILCC, School of Informatics, University of Edinburgh</institution>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We present a system for knowledge graph population with Language Models, evaluated on the Knowledge Base Construction from Pre-trained Language Models (LM-KBC) challenge at ISWC 2022. Our system involves task-specific pre-training to improve LM representation of the masked object tokens, prompt decomposition for progressive generation of candidate objects, among other methods for higher-quality retrieval. Our system is the winner of track 1 of the LM-KBC challenge, based on BERT LM; it achieves 55.0% F-1 score on the hidden test set of the challenge.1 Knowledge graph population is a task of predicting the objects from given subject-relation pairs. For example, for the subject-relation pair &lt;  , StateSharesBorderState &gt;, the task is to predict the appropriate objects such as Faro, Beja, Gibraltar, etc . The task of knowledge graph population is highly related to the task of link prediction in the knowledge graph and Natural Language Processing (NLP) literature [1, 2]; the key diference is that, in knowledge graph population the objects are generated not from a fixed pool of entity nodes, but from an open vocabulary of words.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>http://knowledge-representation.org/j.z.pan/ (J. Z. Pan)</p>
      <p>1Our code and data are available at https://github.com/Teddy-Li/LMKBC-Track1.</p>
      <p>
        Our system falls in the track 1 of the challenge: it seeks to improve BERT [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] language
model’s performance in knowledge graph population from the following three dimensions: 1)
LM representation of masked object tokens; 2) candidate object generation; 3) candidate object
selection (ranking). For improving LM representations, we apply task-specific pre-training,
utilizing silver data retrieved from Wikidata [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] to aid the training process; for candidate
generation, we use prompt decomposition to convert complex knowledge graph population tasks
into multiple simpler tasks; for candidate selection, we use adaptive thresholds, with explorations
to methods for relaxing the single-true-object assumption behind Softmax normalization.
      </p>
      <p>
        In comparison to the winning submission in track 2 [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], we highlight the following
contributions: 1) we show the efectiveness of task-specific pre-training, particularly when doing
so separately for each individual relation; 2) we propose to decompose prompts to split the
task into multiple steps, in order to achieve best results under the constraint of LM size and
capability.
      </p>
      <p>Below, we discuss the above three dimensions of improvements in details in Section 2, 3, 4,
then describe our main experiment results in 5.</p>
      <p>Our method is based on BERT-large-cased LM1, since as a general observation we have found
cased BERT models to outperform uncased ones; we speculate that this can be attributed to
an explicit distinction between named entities and general nouns by capitalization of the first
characters. For all supervised experiments, we split the train set further into a train2 and a dev2
set with respective portions of 80% and 20%. We use the train2 set for training and the dev2
set for checkpointing; this way the dev set is kept as a hidden evaluation dataset, on which we
report results through sections 2 to 4.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Task-Specific Pre-training for Better Representations</title>
      <p>
        Language models like BERT [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] have been trained on diverse texts in large scale; therefore,
by “reminding” the models of what type of information they are supposed to recall from
pretraining, performance is expected to be improved. Along this line, [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] have shown that adaptive
pre-training is helpful for language models’ performance on target domains. In our system,
we explore the approach of training the BERT language model under MLM objective with the
subject-relation-object triples.
      </p>
      <p>The task-specific pre-training approach can be summarized as follows: given a
subjectrelation-object triple, we use the triple to instantiate the corresponding prompt template, to
create a sentence. We then mask those tokens in the sentence that are relevant to the object
entity, and train the BERT models with the masked sentence, where the objective is to recover
these masked tokens.</p>
      <p>One interesting dimension of freedom in this task is which tokens to mask. This is motivated
by our end task, to predict the tokens in the place of objects. Therefore, the representation
of those object tokens is what we are most keen on improving. We further hypothesize that
improving the representation of tokens close to the object tokens2 may also help with the
1https://huggingface.co/bert-large-cased
2For instance, the tokens “a” and “.” in the sentence “A cat sits on a mat .”
prediction of object tokens. Thus, in summary, we mask the tokens corresponding to the object,
as well as tokens beside the object tokens, up to a window of  tokens on each side.</p>
      <p>
        Another interesting dimension of freedom is what training data to use. As a baseline we
have the training set from the LM-KBC challenge to use for generating sentences; however,
the scale of the training set (i.e. 100 subjects per relation) is very small even for a fine-tuning
dataset. To mitigate this data sparsity issue, we further refer to Wikidata [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] for more data
entries, collecting the set of subject-object pairs recorded in Wikidata as satisfying each relation.
Notably, to maintain the integrity of the evaluation, we exclude any overlaps between the
subjects in any subset of the challenge from the subjects in the retrieved entity-pairs3.
3We exclude entries by subject because: 1) it is more secure to exclude subject mentions with arbitrary objects
than to exclude subject-object pairs; 2) for the challenge test set only the subjects are available, so by excluding
overlaps by subjects, we ensure our models are not peeking the test set in any way.
      </p>
      <p>As illustrated in Table 1, 2, we find mixed results from our initial experimentation with
task-specific pre-training: the performance dramatically improves with some relations, and
dramatically drops on others. The trend is consistent across diferent configurations of
intermediate pre-training, while the exact values slightly difer across diferent window sizes  with no
dominant configuration. This implies, for knowledge graph population with BERT language
models, one size doesn’t fit all; we need separately-fine-tuned LM checkpoints for diferent
relations to for the best results. Adapting BERT to attend to diferent relations separately would
be impractical with only the challenge training set because of the small sizes of training data;
however, with the much larger silver datasets retrieved from Wikidata, we are now able to
elicit a family of BERT checkpoints, each dedicated to one or a few relations, where diferent
checkpoints are reminded of diferent types of factual knowledge. When jointly used for link
prediction, our family of BERT checkpoints exhibit superior performance over any single BERT
model, as shown in Table 5.</p>
      <p>As an additional comment, we present the sizes of our additional MLM training data in Table
3, where we show that there is not a clear dependency between the sizes of the training sets
and the performance of task-specifically pre-trained checkpoints. This means, the discrepancy
in performance of this MLM training is not strongly related to the sizes, but rather related to
the properties of the knowledge required for each relation.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Prompt Decomposition for Improved Candidate Generation</title>
      <p>In this section, we discuss the approaches explored for generating better candidate objects by
prompt-based link-prediction. Our eforts here can be broadly classified into two categories:
using better prompts and decomposing the prompts.</p>
      <sec id="sec-3-1">
        <title>3.1. Prompt Elicitation</title>
        <p>On the elicitation of better prompts, we experimented with both manual and automatic
approaches. For the relation PersonInstrument, in order to help BERT ground the names to the
corresponding musicians, we explicitly provide the entity type “musician” as part of the prompt:
“The musician [SUBJ] plays [OBJ], which is an instrument”. For the relation PersonEmployer,
we simplify the prompt into a concise sentence to the same efect: “[SUBJ] works at [OBJ]”.</p>
        <p>
          For automatic elicitation of better prompts, we follow [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] in retrieving sentences from
Wikipedia as potential prompts. First, we split the Wikipedia passages into sentences. Then we
check each sentence with all the subject-object pairs in the LM-KBC dataset to identify if the
sentence contains both entities with exact text matches under lowercase. For generating prompts
from the selected sentences, we follow the mining-based generation methods introduced by
[
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], which includes elicitation of middle-word prompts and dependency-based prompts. The
middle-word prompts are generated by retaining words in between the subject-object pair. The
dependency-based prompts are generated based on the dependency tree, where the prompt
border are selected based on the left-most and right-most word of the shortest dependency path
between the two entities.
        </p>
        <p>In evaluation, we take the top-20 most frequent prompts in the challenge training set, and
rank the performance of BERT LM on the training set using each of these retrieved prompts.
Finally, the average of the top-performing prompts is used as the prediction score for each
predicted object entity. We iteratively add more prompts to the average in the order of the
ranking, and take the best-performing combination (on the training set of the challenge) that is
at least 1% higher than the previous-best using less prompts.4</p>
        <p>
          Results for this experiment are displayed in Table 6. Contrary to results reported in [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], by
using a very similar approach over a similar set of relations, we are not observing the same
scale of improvement with increasing number of prompts involved; in fact, most of the times
the best F-1 score is reached with 1 prompt template, which often is the manually-written one.
We argue that this diference is due to the diference in evaluation metrics: we care about the
F-1 scores rather than the macro average accuracies, which attaches higher importance to the
precision of methods.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Prompt Decomposition</title>
        <p>One issue we found hindering the performance of the baseline is, for the relation
StateSharesBorderState, the subjects and objects are in fact not always states. For instance, the subject
“Andalusia” is a autonomous community, and the subject “Hebei” is a province. By calling
these subjects “states”, as the baseline prompt does, BERT gets confused and outputs irrelevant
object entities. To address this issue, we introduce a pre-condition prompt asking which kind of
location the subject is, with the prompt template “[SUBJ], as a place, is a [MASK].” From all the
candidate tokens that BERT generates for the mask, we look for the following set of keywords:
[state, province, department, city, region]. The top-ranked keyword will be taken as the type of
the subject. Then, the formal prompt would go like “[SUBJ] [KEYWORD] shares border with
[MASK] [KEYWORD]”. In our experiments, we observe a positive efect from this amendment:
4The 1% margin is introduced to prevent overfitting.</p>
        <p>ChemicalCompoundElement
CompanyParentOrganization
CountryBordersWithCountry</p>
        <p>CountryOficialLanguage</p>
        <p>PersonCauseOfDeath</p>
        <p>PersonEmployer
PersonInstrument</p>
        <p>PersonLanguage
PersonPlaceOfDeath</p>
        <p>PersonProfession</p>
        <p>RiverBasinsCountry
StateSharesBorderState</p>
        <p>Average
the F-1 score for the relation “StateSharesBorderState” is increased from 0.112 in baseline to
0.162 just with this one change in prompt formulation.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Explorations for Candidate Selection</title>
      <p>In this section, we discuss the approaches explored for selecting appropriate candidates from
the distribution of candidate tokens as outputted above. Following the baseline, we consider
only the top-100 candidates for each object-to-predict. One immediate observation from the
baseline is, among the top predicted tokens, there often exists pronouns, such as me, them, it,
or determiners, such as the, a, some. Thus, we remove these pronouns and determiners as a
post-processing step to clean up the results.</p>
      <p>Another observation from the baseline is, that the default threshold of 0.5 is too harsh for
many relations, and could be relaxed to optimize the F-1 scores. To this end, we exhaustively
search the thresholds between 0 and 0.95 by steps of 0.01, and select the best thresholds based
on training set F-1.</p>
      <p>Apart from the above, there can be diferent numbers of answer objects for diferent
subjectrelation pairs, and we notice that for those entries with larger numbers of answers, generally
there are more candidate objects with a substantial normalized prediction score. This is in
contradiction to the assumption behind normalized prediction scores: normalized prediction
scores are contemplated as distributions over the tokens, where there is supposed to be only
one true answer. But when there are multiple confidently-predicted candidates, the prediction
score of each of them is diluted. Therefore, by setting a common threshold to all entries, the
answers for those entries with more true answers are disadvantaged, and have a larger chance
of being missed out.</p>
      <p>To mitigate this efect, we first tried removing Softmax from the BERT MLM prediction head.
The idea is, by removing the normalization and exposing the raw scores, all confident predictions
should receive high scores and thus can be thresholded equally. However, experiment results
show that by removing the Softmax function performance drops consistently across all relations.
We speculate that this is because the range of raw prediction scores vary from sentence to
sentence, without a normalization operation the scores themselves are too noisy.</p>
      <p>We further tried keeping the Softmax layer, but additionally introducing sticky thresholds.
That is, we rank the candidate objects by prediction scores and iterate over them, when a
ChemicalCompoundElement
CompanyParentOrganization
CountryBordersWithCountry</p>
      <p>CountryOficialLanguage</p>
      <p>PersonCauseOfDeath</p>
      <p>PersonEmployer
PersonInstrument</p>
      <p>PersonLanguage
PersonPlaceOfDeath</p>
      <p>PersonProfession</p>
      <p>RiverBasinsCountry
StateSharesBorderState</p>
      <p>Average
null
null
0.4
0.91
null
0.76
0.43
null
null
0.49
0.85
0.64
NA
candidate object does not have enough prediction score to meet the threshold, but is relatively
close to its previous candidate (for instance, &gt; 80% the prediction score of its previous candidiate),
we accept this candidate as well. We search for optimal sticky ratios along with the thresholds.
Unfortunately, we observe that while for many relations the best F-1 score is reached with
non-empty sticky ratios, only a very slight improvement is achieved, as shown in Table 8.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Final Results</title>
      <p>In Table 9 is a summary of the techniques we’ve tried. It is not surprising that applying an
adaptive threshold scheme brings substantial improvements; on the other hand, it is interesting
how task-specific MLM training brings another pronounced boost in performance. Prompt
decomposition shows an moderate but convincing improvement, whereas the improvement
from adding in the sticky ratios is negligible.</p>
      <p>In Table 10 are the final results for our system on the challenge test set, as recorded on the
leaderboard5. This final set of results is acquired under the following setup: we use the family of
5https://codalab.lisn.upsaclay.fr/competitions/5815
BERT LM checkpoints based on BERT-large-cased, as presented in Table 5 in Section 2; we use
our manually updated set of prompts (for computation speed) as in Section 3.1, with thresholds
as assigned in Section 4, ignoring sticky ratios; we use the type-assignment decomposition for
“StateSharesBorderState”.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusions</title>
      <p>We have explored methods to improve Knowledge Graph population with LMs under the track 1
constraint of using BERT as the language model backbone. In particular, we explored improving
the LM representation, candidate object generation and candidate selection. We have made
significant progress against the baseline method, and have also found remaining issues, which,
if addressed, would bring further gain in performance and/or versatility. We highlight the
following as promising areas of future work: 1) eficient intermediate fine-tuning for arbitrary
relations; 2) automatic prompt decomposition, with more powerful LM backbones; 3) alternative
re-ranking methods for independent judgement of candidate validity.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>The authors thank the challenge organizers for their timely and helpful response to inquiries,
and the reviewers for their valuable comments. This work is supported in part by a Mozzila
Informatic PhD scholarship.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R.</given-names>
            <surname>Socher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Manning</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ng</surname>
          </string-name>
          ,
          <article-title>Reasoning With Neural Tensor Networks for Knowledge Base Completion</article-title>
          ,
          <source>in: Advances in Neural Information Processing Systems</source>
          , volume
          <volume>26</volume>
          ,
          <string-name>
            <surname>Curran</surname>
            <given-names>Associates</given-names>
          </string-name>
          , Inc.,
          <year>2013</year>
          . URL: https://papers.nips.cc/paper/2013/hash/ b337e84de8752b27eda3a12363109e80-Abstract.html.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bordes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Usunier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Garcia-Duran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Weston</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Yakhnenko</surname>
          </string-name>
          ,
          <article-title>Translating Embeddings for Modeling Multi-relational Data</article-title>
          ,
          <source>in: Advances in Neural Information Processing Systems</source>
          , volume
          <volume>26</volume>
          ,
          <string-name>
            <surname>Curran</surname>
            <given-names>Associates</given-names>
          </string-name>
          , Inc.,
          <year>2013</year>
          . URL: https://papers.nips.cc/paper/2013/hash/ 1cecc7a77928ca8133fa24680a88d2f9-Abstract.html.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , BERT:
          <article-title>Pre-training of Deep Bidirectional Transformers for Language Understanding, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics</article-title>
          , Minneapolis, Minnesota,
          <year>2019</year>
          , pp.
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          . URL: https://aclanthology.org/ N19-1423. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>N19</fpage>
          - 1423.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Wikidata</surname>
            <given-names>:</given-names>
          </string-name>
          <article-title>a free collaborative knowledgebase: Communications of the ACM</article-title>
          : Vol
          <volume>57</volume>
          , No 10, ???? URL: https://dl.acm.org/doi/10.1145/2629489.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>D.</given-names>
            <surname>Alivanistos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. B.</given-names>
            <surname>Santamaría</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cochez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-C.</given-names>
            <surname>Kalo</surname>
          </string-name>
          , E. van Krieken,
          <string-name>
            <given-names>T.</given-names>
            <surname>Thanapalasingam</surname>
          </string-name>
          ,
          <article-title>Prompting as Probing: Using Language Models for Knowledge Base Construction</article-title>
          ,
          <year>2022</year>
          . URL: http://arxiv.org/abs/2208.11057, arXiv:
          <fpage>2208</fpage>
          .11057 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Gururangan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Marasović</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Swayamdipta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Beltagy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Downey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. A.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <string-name>
            <surname>Don't Stop</surname>
          </string-name>
          <article-title>Pretraining: Adapt Language Models to Domains and Tasks, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</article-title>
          , Online,
          <year>2020</year>
          , pp.
          <fpage>8342</fpage>
          -
          <lpage>8360</lpage>
          . URL: https://aclanthology.org/
          <year>2020</year>
          .acl-main.
          <volume>740</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .acl- main.740.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. F.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Araki</surname>
          </string-name>
          , G. Neubig,
          <article-title>How Can We Know What Language Models Know?, Transactions of the Association for Computational Linguistics 8 (</article-title>
          <year>2020</year>
          )
          <fpage>423</fpage>
          -
          <lpage>438</lpage>
          . URL: https://aclanthology.org/
          <year>2020</year>
          .tacl-
          <volume>1</volume>
          .28. doi:
          <volume>10</volume>
          .1162/tacl_a_
          <volume>00324</volume>
          , place: Cambridge, MA Publisher: MIT Press.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>