=Paper=
{{Paper
|id=Vol-3894/dl4kg_paper1
|storemode=property
|title=Relation Extraction for Constructing Knowledge Graphs:
Enhancing the Searchability of Community-Generated
Digital Content (CGDC) Collections
|pdfUrl=https://ceur-ws.org/Vol-3894/dl4kg_paper1.pdf
|volume=Vol-3894
|authors=Martin Marinov,Youcef Benkhedda,Ewan Hannaford,Marc Alexander,Goran Nenadic,Riza Batista-Navarro
}}
==Relation Extraction for Constructing Knowledge Graphs:
Enhancing the Searchability of Community-Generated
Digital Content (CGDC) Collections==
Relation Extraction for Constructing Knowledge Graphs:
Enhancing the Searchability of Community-Generated
Digital Content (CGDC) Collections
Martin Marinov1 , Youcef Benkhedda1 , Ewan Hannaford2 , Marc Alexander2 , Goran Nenadic1
and Riza Batista-Navarro1,*
1
Department of Computer Science, University of Manchester, Oxford Road, Manchester M13 9PL, UK
1
School of Critical Studies, University of Glasgow, Glasgow, G11 6EW, UK
Abstract
Much of people’s understanding of their cultural heritage is facilitated by the curation and preservation of
community-generated digital content (CGDC): archival collections that were created for, with and by local
communities. However, communities employ their own conventions in storing and publishing their content.
Given this and the fact that semantic information tends to be buried within textual descriptions, CGDC archives are
currently siloed and obscured, thus making it difficult for end-users (e.g., members of the public and researchers)
to search for fine-grained information (e.g., “Where did Alfred Edward Julian work?”). In this paper, we propose
to represent the information within CGDC archives in the form of knowledge graphs. To enable the construction
of such knowledge graphs at scale, we developed a zero-shot approach for relation extraction, which we cast as a
natural language inference (NLI) problem. Specifically, for each of the 20 relation types drawn from Wikidata
that we have identified as relevant to CGDC, we created a premise-hypothesis pair that is presented to an NLI
model that determines whether entailment (and thus the relation type) holds. The premise is a sentence from the
natural language description and the hypothesis is automatically generated using a template based on each of the
relation types. We present the results of comparing and combining three different transformer-based models that
were already fine-tuned for the NLI task, namely, DeBERTa, BART and T5.
Keywords
Relation Extraction, Zero-shot Prompting, Transformer Models, Knowledge Graphs, Cultural Heritage
1. Introduction
Cultural heritage is preserved and passed on to future generations through archival collections and their
digitalisation. While enormous efforts (e.g., The National Archives of the UK, Europeana) have been
put into making such collections available to the public, a significant part of people’s cultural heritage
is represented only in community-generated digital content (CGDC): digital-born archive collections
developed for, with and by communities [1]. For example, in the UK, around 5000 community history
projects have been funded by the National Lottery Heritage Fund (NLHF), allowing many communities
to explore and preserve their history and heritage, leading to the proliferation of CGDC collections.
These communities tend to follow their own conventions to store and represent their CGDC collec-
tions. As a result, most of the rich semantic information contained within these collections remain
buried within textual metadata, such as the titles and descriptions of CGDC items. The description
of a photograph of a local landmark, for instance, might mention the person or organisation who
was responsible for building that landmark—information that is potentially of interest to researchers
or members of the public and yet obscured within text. In this work, we propose to transform the
information within the textual metadata of CGDC collections into a knowledge graph, in order to make
CGDC more searchable and queryable. To this end, we investigated zero-shot approaches to relation
extraction (RE) as a means for automatically curating such knowledge graphs, thus eliminating the
DL4KG’24: Deep Learning and Large Language Models for Knowledge Graphs, August 26, 2024, Barcelona, Spain
*
Corresponding author.
$ m.marinov0101@gmail.com (M. Marinov); youcef.benkhedda@manchester.ac.uk (Y. Benkhedda);
ewan.hannaford@glasgow.ac.uk (E. Hannaford); marc.alexander@glasgow.ac.uk (M. Alexander);
gnenadic@manchester.ac.uk (G. Nenadic); riza.batista@manchester.ac.uk (R. Batista-Navarro)
© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
need for any training data. Specifically, we cast RE as a natural language inference (NLI) task, allowing
us to take advantage of three pre-trained transformer-based language models: DeBERTa [2], BART [3]
and T5 [4]. Although RE has been addressed using zero-shot approaches in other domains [5], to the
best of our knowledge, the use of such has been under-explored in the cultural heritage domain. Cetoli
[6] and Chen and Li [7] proposed zero-shot RE approaches for the general domain. Meanwhile, Tang
et al. [8] investigated various pre-trained transformer-based language models for RE in ancient Chinese
history documents; however, their models required some form of training (either by fine-tuning or
chain-of-thought prompting). Most similar to our work is that of Tan et al. [9], who also constructed
knowledge graphs based on entity relations extracted from Chinese cultural heritage texts. However,
their work is rule-based and did not explore the use of transformer-based models.
2. Relation-Annotated Dataset
To support our experimentation with transformer-based RE models, we constructed a dataset of CGDC
textual metadata where relationships between named entities were manually annotated.
2.1. Relation Types
The work of Benkhedda et al. [10] considered the following five entity types as relevant to CGDC:
Person, Organisation, Location, Miscellaneous and Date. Drawing inspiration from their
work, we decided to focus on relation types involving any of these entity types. For this purpose,
instead of creating our own relation types, we leveraged Wikidata properties [11]. Initially, a total of
29 Wikidata properties (relation types) were identified as relevant to CGDC. Our annotation process
(described in Section 2.2), however, revealed that nine of these types were very rarely encountered in
our dataset. Thus, in the end, after keeping only the types with at least three labelled examples, 20
relation types were retained. We refer the reader to Appendix A for a full listing of these relation types.
2.2. Data Collection and Annotation
In our work, we utilised the CGDC dataset that was annotated with named entities (and their corre-
sponding identifiers in Wikidata) constructed by Benkhedda et al. [10]. The subset of their dataset
that was made publicly available consists of 100 documents, where each document corresponds to the
concatenation of the title and description of a CGDC item. Half of the dataset (50 documents) came
from the Morrab Library Photographic Archive comprised of thousands of digitised photographs that
capture Cornish culture and history.1 The other 50 documents were drawn from People’s Collection
Wales (PCW),2 an online platform for gathering various media (e.g., photographs, documents, audio
and video recordings) from individuals and community groups who wish to contribute items related to
Welsh culture and history.
We then designed a simple relation annotation scheme, where we defined a relation as consist-
ing of a head and tail entity, which are linked according to a relation type. For each relation
type, our scheme specifies the possible entity types for each of the head and tail entities. For ex-
ample, for the notable_work relation type, there are three possible head-tail entity combinations:
Person-Miscellaneous (applies when a person created a miscellaneous entity such as a work of art),
Organisation-Miscellaneous (if the creator is an organisation) and Person-Location (applies
when a person is known for building or designing a location such as a church). Appendix A provides
all the valid head-tail entity type combinations for each CGDC relation type.
The brat tool [12] was configured to enable the manual annotation of our relation types of interest,
with the entity type constraints for each relation type specified. Two annotators (the second and
last authors of this paper) independently labelled entity relations in all of the 100 documents. The
inter-annotator agreement (IAA) between the two annotators was determined to be 0.48 in terms of
1
https://photoarchive.morrablibrary.org.uk/
2
https://www.peoplescollection.wales/
Cohen’s Kappa [13], which is considered to be substantial agreement [14], considering that there are
20 possible classes. The annotated set of 100 documents was then expanded by labelling the named
entities and relations in 137 further documents drawn from the PCW collection. Similar to the first
set of documents, two annotators (the third and last authors of this paper) independently annotated
the entity relations in all of the 137 documents and a similar level of IAA (i,e., 0.49 in terms of Cohen’s
Kappa) was obtained. In our experiments for evaluating the performance of transformer-based RE
models, we utilise all the 237 relation-labelled CGDC documents.
3. Methods
The RE task can be defined as follows: given a sentence 𝑠 and a pair of entities 𝑒1 and 𝑒2 contained
within that sentence, as well as a set of classes 𝐶, an RE model should identify one class 𝑐 ∈ 𝐶 that
best describes the relationship between 𝑒1 and 𝑒2 based on 𝑠. In this work, we are concerned with 20
relation types that are relevant to CGDC. However, apart from these, we also include an additional
class that we refer to as None Of The Above or NOTA, to which pairs of entities that are not related
according to any of our 20 CGDC types belong. Thus, there are a total of 21 classes in 𝐶.
3.1. Models
A number of transformer-based models that were already fine-tuned for the NLI task were employed to
extract relations between entities in a zero-shot manner. Each of these models is described below.
DeBERTa (which stands for Decoding-Enhanced BERT with disentangled Attention) improves upon
the encoder-only BERT model by employing a disentangled attention mechanism [15]. In this work, we
employed version 3 of the DeBERTa model [2] that was fine-tuned on NLI datasets.3 This particular
model can be employed in a zero-shot classification manner, whereby the model classifies an input
sequence according to a given set of classes (labels), also providing a probability for each class.
BART is an encoder-decoder model that combines a BERT-like encoder and an auto-regressive
decoder similar to GPT. It has demonstrated satisfactory performance on both text generation and
comprehension tasks (such as text classification) [3]. We employed a version of the BART model that
was fine-tuned for NLI.4 Similar to our chosen DeBERTa model, this model can be employed in a
zero-shot classification manner and provides probabilities together with predicted labels.
T5 stands for Text-to-Text Transfer Transformer [4], which casts many downstream NLP tasks (in-
cluding NLI) as a sequence-to-sequence modelling problem, following an encoder-decoder architecture.
An extra extra large (XXL) version of T5 that was fine-tuned for the NLI task was employed in our
experiments [16].5 Unlike the DeBERTa and BART models, this model does not provide any probability
values together with its predicted labels.
Importantly, we investigated an ensemble model, henceforth referred to as Ensemble, that is a
combination of the above three models. The prediction of this ensemble model was determined by
taking the majority vote amongst the constituent transformer-based models.
3.2. Experimental Setup
As mentioned above, we cast RE as an NLI task, whereby two sentences are provided to a model as
input: a premise and a hypothesis. If based on the premise, the hypothesis is true, then the model should
detect that an entailment relation holds between the two sentences. Otherwise, there is no entailment
relation between the two sentences. To frame RE as an NLI task, an input sentence 𝑠 is considered to be
the premise, while a hypothesis is automatically generated by populating a sentence template—that is
predefined for a particular relation type—with input entities 𝑒1 and 𝑒2. The premise-hypothesis pair is
then presented to an NLI model. If the model detects entailment, then we say that the relation type
3
https://huggingface.co/cross-encoder/nli-deberta-v3-large
4
https://huggingface.co/facebook/bart-large-mnli
5
https://huggingface.co/google/t5_xxl_true_nli_mixture
Table 1
Examples of premise-hypothesis pairs used as inputs to the NLI models.
Relation Type Template Premise (Input Sentence with Enti- Template-
ties Underlined) generated
Hypothesis
inception Operating from 1923, the Lady Magdalen was
was created on Lady Magdalen was just one of created on 1923.
the ferries that would carry passengers
and cars across the Cleddau River.
operating_area operated in Operating from 1923, the Lady Magdalen op-
Lady Magdalen was just one of erated in Cleddau
the ferries that would carry passengers River.
and cars across the Cleddau River.
(corresponding to the template) holds between the input entities. In Table 1, we provide examples of
premise-hypothesis pairs, in which the hypothesis was generated based on a template. The complete
set of templates is provided as part of our codebase.6
In preparation for applying the NLI models in a zero-shot manner, we firstly segment each document
(in the evaluation dataset) into individual sentences. For each of these input sentences, all possible pairs
of entities (contained within a sentence) are created. For every entity pair, a hypothesis is generated for
each relation type. Each of these generated hypotheses is then paired up with the premise (the input
sentence). Finally, our NLI models take every premise-hypothesis pair as an input sample that is then
classified as being characterised by entailment or not. If entailment is not detected (with a probability
of at least 0.40 in the case of the BART and DeBERTa models) for any of the relation types, we assign
the NOTA label to the input sample. In cases where entailment is detected by the model for more than
one class (relation type), we simply take the class with the highest probability value, as provided by the
BART and DeBERTa models. As the T5 model does not output any probability values, we implemented
post-processing rules that specify which relation types should take precedence in case of ties.
4. Evaluation and Error Analysis
In this section, we report the results of applying our chosen NLI models to the RE task in the zero-shot
manner described in the preceding section. As previously mentioned, entity pairs were exhaustively
created based on our evaluation data (the 237 CGDC documents) to form the input samples. This
resulted in 5938 entity pairs that the NLI models need to classify according to relation type (or the lack
thereof, in which case the NOTA class applies). Out of those, 5435 (or 92%) belong to the NOTA class.
In Table 2, we report the performance obtained by each of our models in terms of the standard metrics
of precision, recall and F1-score. To avoid skewing the results towards the over-represented NOTA class,
we report the performance separately for the 20 CGDC relation types, and the NOTA class. Weighted
macro-averaging was employed in reporting combined performance for the 20 CGDC relation types.
With respect to the 20 CGDC relation types, the Ensemble model obtained the best weighted macro-
averaged F1-score of 0.487. Considering the IAA of 0.48-0.49 that we obtained (see Section 2.2), the
model comes close to upper-bound performance. Meanwhile, the models obtained F1-scores as high as
0.914 (for T5) on the NOTA class. This is particularly impressive considering that it is well-known that
handling NOTA cases is a challenging task [17].
We manually examined some cases where the models made incorrect predictions. For instance, in the
sentence “She was born on 10th March 1833 at Market Rasen, Lincolnshire, the eldest daughter of Henry
Albert Browne and Frances Margaret Nicholson”, no relation (NOTA) holds between the entities “Frances
Margaret Nicholson” and “Henry Albert Browne”, according to the gold standard annotations. Both T5
6
Our code and annotations (for the 237 in CGDC documents in our dataset) are available at https://github.com/
OurHeritageOurStories/cgdc_re.
Table 2
Performance of the RE models in terms of weighted macro-averaged scores. For the CGDC types, the
average was taken over all 20 relation types while for the None Of The Above (NOTA) label, the
scores simply correspond to the performance on this one label.
Model Precision Recall F1-score
BART 0.293 0.817 0.403
DeBERTa 0.420 0.718 0.454
CGDC Relation Types
T5 0.474 0.620 0.472
Ensemble 0.431 0.779 0.487
BART 0.993 0.483 0.650
DeBERTa 0.990 0.638 0.776
NOTA
T5 0.979 0.857 0.914
Ensemble 0.991 0.665 0.796
and BART predicted the spouse label for this entity pair (whereas DeBERTa successfully predicted
NOTA). One can, however, argue that even the predictions by T5 and BART are not necessarily incorrect.
The interpretability of such results are rather subjective and can be considered as either correct or
incorrect based on people’s opinion or the intended use of the models. Overall, it was observed that the
models tended to detect or infer implicit relations that human annotators might otherwise miss.
5. Knowledge Graph Curation
Taking the predictions of the Ensemble model over the 237 documents in our dataset, we populated a
knowledge graph (KG) whereby vertices represent named entities and edges represent any relations
detected between them. Here, only entities that were manually assigned Wikidata IDs (as part of gold
standard entity linking annotations) were included, to ensure that entities in the resulting KG are
normalised.
To create the knowledge graph, we utilised the Neo4j framework7 , a graph database management
tool. Using the Cypher query language, we were able to query the resulting knowledge graph. Figure 1
shows a visualisation of the results of an example query for a use case where the relationships of a
particular person of interest (King Edward) with other entities have been retrieved.
Figure 1: Results obtained by querying the populated knowledge graph for information on entities related to
King Edward.
6. Conclusions
In this paper, we demonstrate how a zero-shot approach based on NLI can be employed to extract entity
relations in CGDC textual metadata. Our findings show that, based on evaluation using a dataset of
7
https://neo4j.com/
237 CGDC documents, an ensemble of three different transformer-based models (BART, DeBERTa and
T5) obtains the best weighted macro-averaged F1-score for 20 CGDC relation types. The knowledge
graph that was constructed based on automatically extracted relations provides a means for searching
for information that is otherwise buried within CGDC textual metadata. Our future work will focus
on expanding the CGDC dataset to include more relation-annotated documents that will allow for
more robust evaluation, including experimentation and comparison with closed-sourced large language
models.
References
[1] L. Konstantelos, L. Hughes, W. Kilbride, The Bits Liveth Forever? Digital Preservation and the First
World War Commemoration, Technical Report, IWM War and Conflict Subject Network, 2019.
[2] P. He, J. Gao, W. Chen, DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with
Gradient-Disentangled Embedding Sharing, 2021. arXiv:2111.09543.
[3] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, L. Zettlemoyer,
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Transla-
tion, and Comprehension, in: Proceedings of ACL 2020, ACL, Online, 2020, pp. 7871–7880. URL:
https://aclanthology.org/2020.acl-main.703. doi:10.18653/v1/2020.acl-main.703.
[4] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring
the limits of transfer learning with a unified text-to-text transformer, Journal of Machine Learning
Research 21 (2020) 1–67.
[5] R. Gabud, P. Lapitan, V. Mariano, E. Mendoza, N. Pampolina, M. A. A. Clariño, R. Batista-Navarro,
Unsupervised literature mining approaches for extracting relationships pertaining to habitats and
reproductive conditions of plant species, Frontiers in Artificial Intelligence 7 (2024). URL: https:
//www.frontiersin.org/articles/10.3389/frai.2024.1371411. doi:10.3389/frai.2024.1371411.
[6] A. Cetoli, Exploring the zero-shot limit of FewRel, in: D. Scott, N. Bel, C. Zong (Eds.), Proceedings of
COLING 2020, International Committee on Computational Linguistics, Barcelona, Spain (Online),
2020, pp. 1447–1451. URL: https://aclanthology.org/2020.coling-main.124. doi:10.18653/v1/
2020.coling-main.124.
[7] C.-Y. Chen, C.-T. Li, ZS-BERT: Towards Zero-Shot Relation Extraction with Attribute Represen-
tation Learning, in: Proceedings of NAACL 2021, Association for Computational Linguistics,
Online, 2021, pp. 3470–3479. URL: https://aclanthology.org/2021.naacl-main.272. doi:10.18653/
v1/2021.naacl-main.272.
[8] X. Tang, Q. Su, J. Wang, Z. Deng, CHisIEC: An Information Extraction Corpus for Ancient
Chinese History, in: Proceedings of LREC-COLING 2024, ELRA and ICCL, Torino, Italia, 2024, pp.
3192–3202. URL: https://aclanthology.org/2024.lrec-main.283.
[9] Y. Tan, H. Wang, Z. Zhao, T. Fan, A Joint Entity-Relation Detection and Generalization Method
Based on Syntax and Semantics for Chinese Intangible Cultural Heritage Texts, Journal on
Computing and Cultural Heritage 17 (2024). URL: https://doi.org/10.1145/3631124. doi:10.1145/
3631124.
[10] Y. Benkhedda, A. Skapars, V. Schlegel, G. Nenadic, R. Batista-Navarro, Enriching the metadata
of community-generated digital content through entity linking: An evaluative comparison of
state-of-the-art models, in: Proceedings of LaTeCH-CLfL 2024, ACL, St. Julians, Malta, 2024, pp.
213–220. URL: https://aclanthology.org/2024.latechclfl-1.20.
[11] D. Vrandečić, M. Krötzsch, Wikidata: a free collaborative knowledgebase, Communications of the
ACM 57 (2014) 78–85.
[12] P. Stenetorp, S. Pyysalo, G. Topić, T. Ohta, S. Ananiadou, J. Tsujii, BRAT: a web-based tool for
NLP-assisted text annotation, in: Proceedings of Demonstrations at EACL 2012, 2012, pp. 102–107.
[13] J. Cohen, A coefficient of agreement for nominal scales, Educational and psychological measure-
ment 20 (1960) 37–46.
[14] J. R. Landis, G. G. Koch, The Measurement of Observer Agreement for Categorical Data, Biometrics
33 (1977) 159–174. URL: https://doi.org/10.2307/2529310.
[15] P. He, X. Liu, J. Gao, W. Chen, DeBERTa: Decoding-Enhanced BERT with Disentangled Attention,
in: International Conference on Learning Representations, 2021. URL: https://openreview.net/
forum?id=XPZIaotutsD.
[16] O. Honovich, R. Aharoni, J. Herzig, H. Taitelbaum, D. Kukliansy, V. Cohen, T. Scialom, I. Szpektor,
A. Hassidim, Y. Matias, TRUE: Re-evaluating Factual Consistency Evaluation, in: Proceedings of
NAACL 2022, ACL, Seattle, United States, 2022, pp. 3905–3920. URL: https://aclanthology.org/2022.
naacl-main.287.
[17] O. Sabo, Y. Elazar, Y. Goldberg, I. Dagan, Revisiting few-shot relation classification: Evaluation
data and classification schemes, Transactions of the Association for Computational Linguistics 9
(2021) 691–706.
A. Details of CGDC Relations
The relation types relevant to CGDC, their equivalent human-readable label, their Wikidata identifiers
(linked to a page with their definitions and synonyms) and the types of head and tail entities involved.
Wikidata Property/ Human-readable Wikidata Head Entity Tail Entity
Relation Type Label ID Type Type
Person Organisation
affiliation connected to P1416
Organisation Organisation
date_of_birth born on P569 Person Date
date_of_death died on P570 Person Date
Miscellaneous Person
Miscellaneous Location
depicts shows P180
Miscellaneous Organisation
Miscellaneous Miscellaneous
Person Person
employer worked for P108
Person Organisation
Organisation Date
inception began to exist on P571 Location Date
Miscellaneous Date
location happened in P276 Miscellaneous Location
located_in located in P706 Location Location
member_of member of P463 Person Organisation
Person Miscellaneous
notable_work created or built P800 Organisation Miscellaneous
Person Location
occupation worked as P106 Person Miscellaneous
operating_area operated in P2541 Miscellaneous Location
Person Miscellaneous
participant_in participated in P1344
Organisation Miscellaneous
partnership_with collaborated with P2652 Organisation Organisation
place_of_birth born in P19 Person Location
Miscellaneous Date
point_in_time happened on or as of P585
Location Date
residence lived in P551 Person Location
significant_person wrote to, spoke to or met P3342 Person Person
spouse married to P26 Person Person
Person Location
work_location worked in P937
Organisation Location