Draw Me Like My Triples: Leveraging Generative AI
                                for Wikidata Image Completion
                                Raia Abu Ahmad1 , Martin Critelli2 , Şefika Efeoğlu3,4 , Eleonora Mancini5 ,
                                Célian Ringwald6 , Xinyue Zhang7 and Albert Meroño-Peñuela8
                                1
                                  Deutsches Forschungszentrum für Künstliche Intelligenz (DFKI GmbH), Berlin, Germany
                                2
                                  University Ca’Foscari of Venice, Italy
                                3
                                  Freie Universität Berlin, Germany
                                4
                                  Technische Universität Berlin, Germany
                                5
                                  University of Bologna, Italy
                                6
                                  Université Côte d’Azur, Inria, CNRS, I3S, France
                                7
                                  University of Oxford, Oxford, United Kingdom
                                8
                                  King’s College London, London, United Kingdom


                                                                         Abstract
                                                                         Humans are critical for the creation and maintenance of high-quality Knowledge Graphs (KGs). However,
                                                                         creating and maintaining large KGs only with humans does not scale, especially for contributions based on
                                                                         multimedia (e.g. images) that are hard to find and reuse on the Web and expensive to generate by humans
                                                                         from scratch. Therefore, we leverage generative AI for the task of creating images for Wikidata items
                                                                         that do not have them. Our approach uses knowledge contained in Wikidata triples of items describing
                                                                         fictional characters and uses the fine-tuned T5 model based on the WDV dataset to generate natural
                                                                         text descriptions of items about fictional characters with missing images. We use those natural text
                                                                         descriptions as prompts for a transformer-based text-to-image model, Stable Diffusion v2.1, to generate
                                                                         plausible candidate images for Wikidata image completion. We design and implement quantitative and
                                                                         qualitative approaches to evaluate the plausibility of our methods, which include conducting a survey to
                                                                         assess the quality of the generated images.

                                                                         Keywords
                                                                         Generative AI, Image Generation, Automated Prompt Generation


                                1. Introduction
                                Large knowledge bases (KBs) such as Wikidata are maintained by human editors in a collabora-
                                tive manner in order to provide structured data of high quality [1]. However, given the size of
                                this platform, there is an evident problem of incompleteness that creates several content gaps
                                [2]. We note that this is especially true for contributions based on multimedia (such as images,

                                Wikidata’23: Wikidata workshop at ISWC 2023
                                Envelope-Open raia.abu_ahmad@dfki.de (R. Abu Ahmad); martin.critelli@unive.it (M. Critelli);
                                sefika.efeoglu@fu-berlin.de,tu-berlin.de (Ş. Efeoğlu); e.mancini@unibo.it (E. Mancini); celian.ringwald@inria.fr
                                (C. Ringwald); xinyue.zhang@cs.ox.ac.uk (X. Zhang); albert.merono@kcl.ac.uk (A. Meroño-Peñuela)
                                Orcid 0009-0004-8720-0116 (R. Abu Ahmad); 0000-0002-8177-730X (M. Critelli); 0000-0002-9232-4840 (Ş. Efeoğlu);
                                0000-0001-9205-3289 (E. Mancini); 0000-0002-9232-4840 (C. Ringwald); 0000-0002-9232-4840 (X. Zhang);
                                0000-0003-4646-5842 (A. Meroño-Peñuela)
                                                                       © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                    CEUR
                                    Workshop
                                    Proceedings
                                                  http://ceur-ws.org
                                                  ISSN 1613-0073
                                                                       CEUR Workshop Proceedings (CEUR-WS.org)


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
audio, and video) since it is difficult for editors to find such high-quality contributions on the
Web, and even more difficult and expensive to create them from scratch.
In this work, we examine the problem of missing images for a specific class of Wikidata entities:
fictional characters. We motivate this choice by the fact that querying Wikidata shows that
only 7% out of the 83.7K instances of the fictional character class, including its sub-classes, have
an image 1 . It is important to note that this class was specifically chosen due to ethical and
privacy concerns, as other classes (e.g. person) can have detrimental consequences if automatic
images are created to portray them. Although alternative methods for finding images for this
class of entities exist (e.g. fan-created images), they are unreliable in terms of the objective
representation of characters and demand thorough and manual research by editors to make
sure that they correctly align with Wikidata’s knowledge about each entity.
We propose a novel method of leveraging knowledge from Wikidata triples about each fictional
character entity in order to create a representative image for it using generative artificial
intelligence (AI) models. This is done by (1) extracting triples from Wikidata entities, (2) creating
English prompts to be fed into a generative text-to-image model such as Stable Diffusion [3],
and (3) generating a representative image of the character that could potentially be used on
Wikidata. We investigate the effectiveness of this approach by generating four different types
of prompts in English, including using triple verbalisation with large language models (LLMs),
for each character and comparing the resulting images. We evaluate our approach based on
a ground-truth dataset that consists of fictional characters which already have an image on
Wikidata. We select different metrics of automatic image comparison to measure how similar
each generated image is to the ground-truth one. Additionally, since automatic measures
for image comparison are limited, we conduct a human evaluation survey in which we ask
participants to evaluate image similarity.
Our work addresses the following research questions (RQs):
     • RQ1: To what extent can different types of prompts based on triples be used in text-to-
       image models to produce high-quality images?
     • RQ2: To what extent can the output of generative AI be used for Wikidata image comple-
       tion?
     • RQ3: How can generative text-to-image models be evaluated?
To the best of our knowledge, no previous study has explored the realm of using Wikidata
as a source for creating prompts for generative text-to-image models. Our work 2 offers the
following contributions:
     • A framework that generates prompts for a text-to-image model (Stable Diffusion v2.1 3 )
       with different levels of structure and natural language text based on Wikidata triples.
     • A dataset of generated images for fictional characters extracted from Wikidata that can
       potentially be used by editors for image completion.
     • An evaluation strategy showing evidence of relevancy and adequacy of using AI-generated
       images for our use case.

    1
      This query was performed in June 2023.
    2
      The project and dataset are available at https://github.com/helemanc/gryffindor and at https://huggingface.co/
gryffindor-ISWS, respectively.
    3
      The model card is at https://huggingface.co/stabilityai/stable-diffusion-2-1
2. Related Work
Generative AI: Current groundbreaking advances in AI enable machines to generate novel
and original content based on textual prompts. Such generative applications include text-to-
text [4], text-to-image [5, 6], and even text-to-music [7]. Generally, these models can capture
complex patterns from the input text and produce coherent outputs. A recent survey [8] shows
that text-to-image applications specifically have been emerging since 2015, when AlignDRAW
[9] pioneered the field by leveraging recurrent neural networks (RNNs) to encode textual
captions and produce corresponding images. Since then, end-to-end models started leveraging
architectures such as deep convolutional generative adversarial networks (GANs) [10, 11, 12],
autoregressive methods [13, 14, 15], latent space models [16, 17, 18], and the current state-of-
the-art diffusion-based methods [19, 20, 21].
Prompt Engineering: Because of the aforementioned advances, a novel area of prompt
engineering has emerged, in which humans interact with AI in an iterative process to produce
the best prompt (i.e. textual input) for a specific desired output [22]. Recent work has shed light
on prompt engineering for AI art specifically [23], concluding that simple and intuitive prompts
written by humans are not enough to get desired results. Rather, writing good prompts is a
learned skill that is enhanced by the usage of specific prompt templates4 and modifiers [24].
Automatic Prompt Generation: When it comes to automatic prompt generation, previous
studies tend to investigate using LLMs to construct prompts using techniques such as text
mining, text paraphrasing, and data augmentation [25]. However, to the best of our knowledge,
no work has touched upon using large KBs such as Wikidata for prompt engineering and
generation.


3. Proposed Approach
We conduct our study on instances of the class designated as fictional character with Q95074
item ID. The initial stage of our approach involves the extraction of relevant triples pertaining
to a specific entity. Subsequently, these triples are used to form various types of prompts in
English, functioning as inputs to Stable Diffusion [21], a text-to-image AI model. We generate
different types of prompts related to the triples, including a verbalised triples prompt which uses
the T5 language model fine-tuned on the WDV dataset [26]. This verbalisation model converts
triples into fluent language [27]. The ultimate goal is to generate suitable images that can serve
as accurate visual identifiers for their corresponding Wikidata entities. This pipeline is shown
in Fig. 1.

3.1. Triple Extraction
Generating an image for a specific character requires a description that can be gathered from its
related triples. In Wikidata, these can be obtained through SPARQL queries 5 , yielding all triples
with the character as the subject. Moreover, since properties and entities might be represented
    4
     https://sweet-hall-e72.notion.site/A-Traveler-s-Guide-to-the-Latent-Space-85efba7e5e6a40e5bd3cae980f30235f
    5
     All the SPARQL queries with detailed explanations are available at https://github.com/helemanc/gryffindor/
blob/main/src/data-collection/wiki_query_service.py.
Figure 1: The pipeline of our proposed KG completion process.


by item IDs and property IDs, we also extract and translate each triple from these IDs to their
corresponding labels when the triple has an entity and property ID.

3.2. Prompt Generation
In order to investigate if prompts closer to natural language work better at generating better
images, we generate four distinct prompts for each character. The first three are created utilising
the set of triples that have been extracted for the respective entity from Wikidata, while the last
prompt utilises English DBpedia abstracts. These prompts are defined as follows (see Table 15
for prompt samples of an entity):
   1. Basic Label: This prompt merely employs the “label” that Wikidata assigns to its entities.
   2. Plain Triples: This prompt is derived by concatenating the subject, predicate, and object
      of a triple to form a single sentence, utilising all available triples linked to a specific entity.
      Notably, sentences generated from plain triples may lack proper structure and grammar.
   3. Verbalised Triples: Triple verbalisation is defined as the transformation of structured
      data (i.e., triples) into human-readable formats (i.e., text). These serve as a summarised
      paragraph of all input triples.
   4. DBpedia Abstracts: We use DBpedia abstracts as prompts obtained by querying the
      English chapter of DBpedia [28]. Originally written by human editors on Wikipedia,
      these abstracts are automatically extracted by DBpedia, preprocessed, and shortened.
      Unlike previous prompt types, this is the only one originally written by a human in
      natural language.
When examining the triples for a single entity, we observe that triples sharing the same predicate
tend to contain redundant information. As a result, prompts generated directly from these
plain or verbalised triples will repetitively state the same facts. However, the “instance of”
predicate seems to provide distinct information for each triple. To avoid duplicating facts in
prompt types (2) and (3), we remove duplicate predicates, except for “instance of”, for the input
triples. Among the remaining triples that share the same predicate, we keep only the one with
the longest object, since longer objects likely contain more detailed information than shorter
objects with the same predicate.
3.3. Image Generation
To ensure reproducibility in image generation, we utilise Stable Diffusion version 2.1 6 , an
open-source text-to-image model developed by Stability AI limited to the English language.
We chose version 2.1 because it supports all input shapes up to 1024x1024 and has a better
performance according to benchmark evaluation results [29].
It is important to note that this particular model has inherent limitations when it comes to
generating images related to the human body. To address this issue and enhance its image
generation capabilities, we employ the implementation of negative prompts that have been
suggested and shared on a public GitHub repository 7 . By incorporating these negative prompts,
we aim to mitigate malformations in images (e.g. crossed eyes, more than five fingers, etc.).
Moreover, since Stable Diffusion has a limitation on the number of tokens allowed in the prompt
sentence(s), we embed the prompt by utilising the encoder and tokenizer from Stable Diffusion,
courtesy of the Compel library 8 . The model runs positive prompts of 1500 fictional characters
without existing images on Wikidata and 1500 with images on Wikidata, the latter to be used
for building a ground-truth dataset for evaluation.


4. Collected Dataset
We construct an extensive dataset 9 comprising 1500 fictional characters with images, as well
as 1500 fictional characters without images, which are randomly chosen from the entire set
of fictional characters on Wikidata. Our motivation for collecting data on fictional characters
rather than real people lies in our commitment to upholding ethical standards and safeguarding
privacy. Also, there is no available dataset about fictional characters, and our data collection
source codes 10 can be easily applied to different domains by changing parameter settings.
In addition, we extend our data by fetching the Wikipedia abstracts of each fictional character
from the English chapter of DBpedia. Although a majority of these fictional characters lack
information in DBpedia because it is constructed using English Wikipedia, this is not a problem
in our case since the Stable Diffusion model can only use English text as input. Since most of
the fictional characters on Wikidata (ca. 78% 11 ) do not have any English Wikipedia page, we
only managed to gather DBpedia abstracts for 925 fictional characters with images on Wikidata
and 341 fictional characters without images (see Table 1).
By analysing basic statistics from Table 1, we directly notice a big descriptive gap in terms of
triples, the number of extracted unique relations, and the length of the prompts between the
two datasets we constructed. Moreover, we notice that the length of the prompt is usually the
shortest for verbalised triples and the longest for plain triples. After gathering the data about
the fictional characters from Wikidata and DBpedia, four different prompts are automatically

    6
       Stable Diffusion v2.1 model card: https://huggingface.co/stabilityai/stable-diffusion-2-1
    7
       The negative prompts: https://github.com/mikhail-bot/stable-diffusion-negative-prompts
     8
       Compel encodes and decodes the portion of the prompt and available at https://github.com/damian0815/compel
     9
       The entire dataset is available at https://huggingface.co/gryffindor-ISWS
    10
       The data collection codes are available at https://github.com/helemanc/gryffindor/blob/main/src/
data-collection/wiki_query_service.py
    11
       The percentage is computed on Sept. 6, 2023
constructed by using the approach described in section 3. One example is shown in Figure 2,
which depicts the ground-truth image for the character Harlequin with its four generated
images. Based on this example, it is instantly clear that some of the prompts can produce images
more similar to the ground truth.

Table 1
Statistical information about the datasets used in the evaluation of the approach.
                                                   Characters              Characters
                                                   with Images             without Images
      # entities                                     1500                    1500
      # of DBpedia abstracts                         925                     341
      # of Wikidata triples                          35 281                  23 157
      Average # of relation by entity                19                      15
      Average # of unique relation by entity         9                       6
      Mean tokens length of DBpedia abstracts        213                     175
      Mean tokens length of plain triples            321                     199
      Mean tokens length of verbalised triples       89                      68


Figure 2: Images for the character of Harlequin. (a) Ground truth from Wikidata. (b) Generation from
the basic label prompt. (c) Generation from the plain triples prompt. (d) Generation from the verbalised
triples prompt. (e) Generation from the DBpedia abstract prompt.


5. Evaluation
In order to understand whether the generated images can plausibly be used for representing
fictional characters based on their Wikidata triples, we employ two evaluation strategies. The
first is an automatic evaluation of image similarity using different metrics, while the second is a
human evaluation survey. Since the task of identifying whether two images portray the same
character is subjective and difficult, we consider both qualitative and quantitative evaluation
approaches. This helps us better understand the effect of the prompt type on the quality of the
different generated images. In this section, we first describe the evaluation framework used,
explaining the different metrics we took into account. Then, we present the obtained results.

5.1. Evaluation Framework
5.1.1. Automatic evaluation
We utilise automated evaluation methods based on three image comparison metrics:
     • UQI [30]: computes a pixel-based similarity score by comparing generated images with
       their corresponding ground-truth images. Notably, since the majority of the original
       images are in grayscale, the similarity computation also takes into account their grayscale
       versions. UQI evaluates “image quality based on factors such as loss of correlation,
       luminance distortion, and contrast distortion” [30].
     • CLIPscore [31]: leverages embeddings produced by a contrastive language-image pre-
       trained model [5]. It is used for measuring image-caption compatibility by comparing
       image and text embeddings using cosine similarity. CLIP embeddings can be used for
       image-to-image comparisons as well, which we did by using the image encoder of the
       CLIP-Visual Transformer model [32]: ViT-L/14.
     • FID [33]: is an improved version of Inception Distance (IS) proposed to measure the
       quality of images produced by generative models.
Since the computation of the FID metric is more time-consuming than the other two, we compute
it only on a small subset of our dataset consisting of images generated for ten random characters.
On the other hand, UQI and CLIPscores are computed on the entire dataset.
Additionally, we employ statistical methods for evaluating if the metrics above can measure
the impact of the prompt on the quality of the generated images. For this purpose, we perform
ANOVA to measure the effect of the prompt on the metric. We also perform Tukey’s HSD
(honestly significant difference) tests on the metrics to reflect the prompts’ effect on the generated
images. These statistical methods were computed on two subsets of our dataset: characters that
have DBpedia abstracts and characters that do not.
Finally, we performed several Student tests to evaluate if a given property (e.g., the gender and
occupation of a character) could lead to better results, and we separately made the test only on
the values of the instance of property (P31). To carry out these tests we extract the types and
properties used more than 100 times. For each property, we build two subsets. The first one
includes evaluation metric results of the characters that contain the property, and the second is
built by randomly choosing characters that do not have the evaluated property.

5.1.2. Human evaluation
Although the above-mentioned evaluation metrics can provide automatic measures to compare
images, they are still unreliable in comparing whether the generated images successfully portray
the same characters as the ground-truth images. This is because noise such as the image style
or its color can affect the results of the automatic metrics. Therefore, we conduct a human
evaluation study in which we ask participants to rate how likely it is that a pair of images
(consisting of 1. the ground-truth image and 2. the generated image) portray the same character.
Additionally, we ask participants to list the criteria they think about when comparing two
images. The latter was done to get an idea of important features to look for when generating
images of fictional characters. For evaluating the agreement of the participants we compute
Krippendorff’s Alpha [34] on three levels: globally, per evaluated image, and per prompt type.

5.2. Evaluation Results
5.2.1. Automatic Evaluation
The results we obtained from the automatic evaluation metrics show different outcomes. In
terms of UQI, all four prompt types yield a similar average similarity score of ca. 0.5, concluding
that this metric is not optimal for our purposes. FID results (Table 5) show that images created
from DBpedia are more similar to the ground-truth. When it comes to CLIPscores, we see
a hierarchy of prompt types in terms of the obtained average similarity scores, with images
generated by basic labels being the least similar (with a score of 0.48), followed by plain triples
(with a score of 0.55), verbalised triples (with a score of 0.56), and lastly DBpedia abstract
prompt seem to generate the most similar images to the ground-truth with a CLIPscore of 0.6.
Results for UQI ad CLIP are shown in Table 4. It is important to note that contrary to UQI and
CLIPscores, the FID metric is performed only on a subset of images generated for ten fictional
characters, which makes it hard to make any concrete conclusions about this method.
The ANOVA conducted on the UQI and the CLIPscores is shown in Table 6 and Table 7. The
results show that the UQI is not able to underline a significant difference between the prompt
type as a main fixed effect on the quality of the generated image. In contrast, CLIPscores are
able to reflect this effect with high confidence.
The results of Tukey’s HSD test are shown in Tables 8 and 9. They highlight that the basic
prompt is generally the worst prompt strategy and that the DBpedia abstract prompt is always
the best one. However, for characters that do not have a DBpedia abstract, the verbalised triples
prompt is better than the basic label and the plain triples prompts (with a p-value of 0.05810).
Additionally, in order to understand if the number of relations and unique relation type attached
to a given entity in extracted triples has an effect on the generated image quality, we compute
the correlation between these variables with the CLIPscores. Results indicate that there is no
such correlation (see Table 10).
Finally, we present the results of the Student tests related to the effect of the values of the
instance of properties attached to an entity on the quality of generated images in two parts.
The first displays the effect of values of the instance of properties on the generated images
in Table 11 for plain triples prompts, and Table 13 for verbalised triples prompts. We can see
that characters that already have a widely known visual representation (e.g. characters from
comics, cartoons, or movies) generally have low CLIPscores. On the other hand, characters that
do not have a visual representation (e.g. from written works such as novels) are usually more
similar to the ground-truth images. The second part of the Student tests deals with the effect of
properties on the quality of the generated images. Results are shown in Table 12 and Table 14.
These results show that the majority of relations impact CLIPscores negatively.
5.2.2. Human Evaluation
To measure how humans evaluate the similarity of generated and ground-truth images, we ran
an evaluation survey in which each participant is presented with images of ten different fictional
characters chosen randomly from our dataset (shown in Figure 5). For each character, four pairs
of images were displayed. Each pair consisted of the ground-truth image and a generated image.
Participants were asked to rate how likely it is that both images portray the same character on
a scale of 1-5, 1 being very unlikely and 5 being very likely. Figure 3 shows the distribution of
participant replies for all ten characters based on prompt types. We immediately notice that
images generated based on the three prompt types of basic labels, plain triples, and verbalised
triples are more likely to be evaluated as not similar to the ground-truth image (i.e. the most
frequent response for all three prompt types is one). On the other hand, images generated with
DBpedia abstract prompts are most frequently rated as 3 and 4, both having the same number
of responses. When examining the high numbers on the scale that indicate a high similarity
between the ground truth and the generated images (i.e. 4 and 5), we notice a specific trend.
The least frequent prompt type for those numbers is the basic label, followed by plain triples,
verbalised triples, and DBpedia abstracts.


Figure 3: Distribution of the human evaluation survey results for all four prompt types.


Additionally, we analyse participant responses to the open question of which criteria they
consider when giving their responses. Figure 4 presents the top ten criteria mentioned by
participants. The analysis was done by extracting nouns and adjectives, and filtering out stop
words and generic terms such as ‘character’. We also manually grouped synonymous concepts
such as clothes, clothing, and outfit.
In total, our survey had 101 participants ranging between the ages of 17-59 with an average
age of 30. About 57% of participants were male, 41% female, and 2% non-binary. 48% of the
participants had a master’s education level. We did not target any specific group since we
wanted to receive general responses regarding the similarity of images. Thus, we distributed
the survey among friends and colleagues both from within and outside the research community.
The cultural backgrounds of participants ranged from South and North America (ca. 8%) to
Figure 4: Results of the human evaluation survey related to the question about the key elements that
have influenced user’s evaluation.


Europe (ca. 63%), East Asia (ca. 10%), and the Middle East (ca. 20%). That being said, ca. 25% of
participants were Italian. We received 37 responses for the open question of listing relevant
criteria.
Finally, we measured the agreement of participants using Krippendorff’s alpha. The global
score is equal to 0.17, meaning that no concrete agreement was found. The same conclusion
could be drawn at the level of the images presented to the participants and at the level of the
prompt types used for the image generation (cf. Table 2 and Table 3).

5.2.3. Automatic and Human Evaluation Alignment
As a last step in our evaluation, we want to assess if there is an alignment between the score of
the automatic metrics (UQI, CLIPscore, and FID) and the human evaluation. In order to be able
to normalise the participant evaluations, we standardise the score given by each participant.
For 𝑖 ∈ {1, .., 101}, the unique ID of a given participant, and 𝑗 ∈ {0, 39}, a given generated image,
the standardised score is computed as follows:
                                                       𝑥𝑖,𝑗 − 𝜇𝑖
                                         𝑥𝑖,𝑗𝑠𝑡𝑎𝑛𝑑 =
                                                          𝜎𝑖

The alignment between automatic metrics and human evaluation scores is shown in figure
6. We see that CLIPscores seem to be most correlated to the human scores with a Pearson
correlation of 0.5 for the plain triples prompt, 0.6 for the verbalised triples prompt, and 0.7 for
the basic label prompt. Concerning the DBpedia abstract prompt, none of the metrics seem to
be correlated with the human evaluation. UQI and FID are not correlated to human evaluation,
results, both having scores close to zero.
6. Discussion
Results of most automatic evaluation approaches we used (CLIPscores, ANOVA, and Tukey’s
HSD) as well as the human evaluation results suggest a clear trend: images generated using
DBpedia abstracts as prompts were rated as most similar to the ground truth images, followed
respectively by verbalised triples prompts, plain triples prompts, and basic label prompts. This
implies that DBpedia abstracts, which are written by human editors and contain more natural,
diverse, and fluent text, enable text-to-image generators to produce better results. The fact that
verbalised triple prompts produce the second-best results further emphasises the importance of
fluent text on the quality of the generated image. These results directly answer RQ1.
When further analysing the obtained CLIPscores, we observe that the maximum CLIPscore
occurred for an image generated by using a basic label prompt, possibly indicating that the
text-to-image model had “seen” this character during training. This enabled it to create a similar
image to the ground-truth one without adding any additional context. However, the lowest
CLIPscore also occurred when using the prompt type of basic label, further emphasising that
for some characters, generating an image based only on their label is not enough.
We conclude that in order to automatically generate images for fictional characters that correctly
portray them, using natural text descriptions is the best option. When this text is available
(e.g. in DBpedia abstracts), it is best to use it, however, as we have observed when creating our
dataset, many entities of fictional characters (See Table 1) do not have a DBpedia abstract. To
create images for those characters, the best method seems to be extracting knowledge about
them in the form of triples, verbalising those triples using a large language model, and giving
the verbalised text as input to a text-to-image generative model. In this case, the content of
the triples is crucial for generating high-quality images. However, the quality of images is not
related to the number of triples or the number of unique relations contained in the triples. But is
highly dependent on object values, highlighting the impact of the value of instance of property
of fictional characters on the quality of generated images. Answering RQ2, generated images
can then be leveraged for completing missing images in Wikidata entities.
Finally, addressing RQ3, when comparing the three automatic evaluation metrics we see that
only the CLIPscores align with the human evaluation scores. This is because, unlike the human
and CLIP evaluations which assess semantic similarity, the UQI and FID metrics only focus on
image quality. This limitation in evaluating semantic content likely explains the discrepancy in
results between the three automatic evaluation metrics.


7. Limitations and Risks
Our work is limited in many aspects. First, we are currently dealing only with English data due
to the limitations of the verbalisation model and the Stable Diffusion model we used. Future
work will consider dealing with multilingual datasets as well.
Additionally, when designing prompts based on Wikidata triples, we had to make decisions such
as extracting triples based on subjects without considering objects. We also treated all triples
equally with no emphasis on properties or types of entities. As shown by the open question in
our human evaluation survey, it is evident that some properties are more important than others
when generating images to portray a specific character. Future work can potentially explore
in more depth which properties lead to better representations of characters. Further, when
encountered with triples that have the same predicate, we selected the one with the longest
object assuming it would contain more information. We are aware that this decision might
have removed important information for characters, and this can be addressed in future work
by concatenating object strings or summarising them automatically.
Our usage of the Stable Diffusion generative model means that our method is inheriting its
biases as well. Although directly leveraging information about each character from its triples is
supposed to limit biases when generating images, this cannot always be controlled (e.g., for
some female entities, the model generated images of male characters). Additionally, using a pre-
defined set of negative prompts for all characters (which includes terms such as mutilated and
disfigured) is a considerable limitation of the model to correctly portray characters. A possible
solution for this could be to design specific negative prompts for each individual character in a
semi-automatic manner or to use another type of text-to-image model that does not require
negative prompting.
Our work is also limited in terms of the ground-truth dataset constructed based on Wikidata
entities that already have images. This is because, for some of these entities, the images are not
reliably portraying the character, but the actor depicting the character.
Finally, in order to mitigate any copyright and/or privacy risks, we stress that our method is not
suggested to be directly deployed into Wikidata, as we think that using AI-generated images can
potentially be very harmful. Should this method be used for image completion, we encourage
clearly watermarking images as AI-generated.


8. Conclusion
In this paper, we investigate four different methods for generating prompts based on extracting
knowledge in the form of triples. We then generate images based on each prompt using
Stable Diffusion, a generative text-to-image model. We evaluate the different prompt types by
automatic as well as human evaluation approaches and conclude that the best-generated images
are based on natural language text that includes the context and background of the character.
When possible, this text can be extracted from a human-edited source such as DBpedia abstracts,
however, most characters do not have a DBpedia entity. This brings to light the need to verbalise
triples (i.e. transform them into natural text based on large language models) and use them
as prompts in order to receive the best visual representation of their corresponding fictional
characters. To the best of our knowledge, our work is novel in terms of utilising triples for
prompt engineering in order to complete missing information on Wikidata. Possible future
work includes finetuning the last Stable Diffusion model via a Lora adaptation [35], trying other
text-to-image models that rely on different architectures, and modifying prompts to include the
most significant triples by investigating which properties affect image quality the most. Our
approach is not intended to directly complete entities on Wikidata with AI-generated images,
rather it can be used by editors to further enrich entities such as fictional characters, fictional
places, or landscapes. Alternatively, instead of directly using the output of generative models,
they could be given to artists who can use them as inspiration to create depictions of entities.
Acknowledgments
This project is the result of a research task force team at the International Semantic Web Summer School (ISWS) 2023. We would
like to thank the organisers and tutors. We especially thank our mentor Albert Meroño-Peñuela for his valuable advice and input
throughout this work.
We made use of the central High Performance Computer system at Freie Universität Berlin to conduct the data collection and
image generation parts, and we would like to express our gratitude for the resources provided.
The work of the author, Sefika Efeoglu, is funded by the German Federal Ministry of Education and Research (BMBF) and the state
of Berlin under the Excellence Strategy of the Federal Government and the Länder over the project.
This paper has been developed within the HE project MuseIT, which has been cofounded by the European Union under the Grant
Agreement No 101061441. Views and opinions expressed are, however, those of the authors and do not necessarily reflect those of
the European Union or European Research Executive Agency.
This work has been supported by the French government, through the 3IA Côte d’Azur Investments in the Future project managed
by the National Research Agency (ANR) with the reference number ANR-19-P3IA-0002.


References
 [1] D. Vrandečić, Wikidata: A New Platform for Collaborative Data Collection, in: Proceedings of the 21st international
     conference on world wide web, 2012, pp. 1063–1064.
 [2] D. Abián, A. Meroño-Peñuela, E. Simperl, An Analysis of Content Gaps Versus User Needs in the Wikidata Knowledge
     Graph, in: International Semantic Web Conference, Springer, 2022, pp. 354–374.
 [3] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, B. Ommer, High-Resolution Image Synthesis With Latent Diffusion Models,
     in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 10684–10695.
 [4] Introducing ChatGPT, 2023. URL: https://openai.com/blog/chatgpt.
 [5] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger,
     I. Sutskever, Learning Transferable Visual Models From Natural Language Supervision, in: M. Meila, T. Zhang (Eds.),
     Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume
     139 of Proceedings of Machine Learning Research, PMLR, 2021, pp. 8748–8763. URL: http://proceedings.mlr.press/v139/
     radford21a.html.
 [6] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, M. Chen, Hierarchical Text-Conditional Image Generation with CLIP Latents,
     ArXiv abs/2204.06125 (2022).
 [7] F. Schneider, Z. Jin, B. Schölkopf, Moûsai: Text-to-Music Generation with Long-Context Latent Diffusion, arXiv preprint
     arXiv:2301.11757 (2023).
 [8] C. Zhang, C. Zhang, M. Zhang, I. S. Kweon, Text-to-image Diffusion Models in Generative AI: A Survey, arXiv preprint
     arXiv:2303.07909 (2023).
 [9] E. Mansimov, E. Parisotto, J. L. Ba, R. Salakhutdinov, Generating Images from Captions with Attention, arXiv preprint
     arXiv:1511.02793 (2015).
[10] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, H. Lee, Generative Adversarial Text to Image Synthesis, in: International
     conference on machine learning, PMLR, 2016, pp. 1060–1069.
[11] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, D. N. Metaxas, StackGAN: Text to Photo-realistic Image Synthesis with
     Stacked Generative Adversarial Networks, in: Proceedings of the IEEE international conference on computer vision, 2017,
     pp. 5907–5915.
[12] T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, X. He, AttnGAN: Fine-Grained Text to Image Generation with
     Attentional Generative Adversarial Networks, in: Proceedings of the IEEE conference on computer vision and pattern
     recognition, 2018, pp. 1316–1324.
[13] A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, I. Sutskever, Zero-Shot Text-to-Image Generation, in:
     International Conference on Machine Learning, PMLR, 2021, pp. 8821–8831.
[14] M. Ding, Z. Yang, W. Hong, W. Zheng, C. Zhou, D. Yin, J. Lin, X. Zou, Z. Shao, H. Yang, et al., Cogview: Mastering
     text-to-image generation via transformers, Advances in Neural Information Processing Systems 34 (2021) 19822–19835.
[15] C. Wu, J. Liang, L. Ji, F. Yang, Y. Fang, D. Jiang, N. Duan, Nüwa: Visual Synthesis Pre-training for Neural visUal World
     creAtion, in: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings,
     Part XVI, Springer, 2022, pp. 720–736.
[16] D. P. Kingma, M. Welling, Auto-Encoding Variational Bayes, CoRR abs/1312.6114 (2013).
[17] A. Vahdat, J. Kautz, NVAE: A Deep Hierarchical Variational Autoencoder, 2021. arXiv:2007.03898 .
[18] R. Child, Very Deep VAEs Generalize Autoregressive Models and Can Outperform Them on Images, ArXiv abs/2011.10650
     (2020).
[19] A. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. McGrew, I. Sutskever, M. Chen, GLIDE: Towards Photorealistic
     Image Generation and Editing with Text-Guided Diffusion Models, arXiv preprint arXiv:2112.10741 (2021).
[20] C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, T. Salimans,
     et al., Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding, Advances in Neural Information
     Processing Systems 35 (2022) 36479–36494.
[21] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, B. Ommer, High-Resolution Image Synthesis with Latent Diffusion Models,
     in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10684–10695.
[22] V. Liu, L. B. Chilton, Design Guidelines for Prompt Engineering Text-to-Image Generative Models, in: Proceedings of the
     2022 CHI Conference on Human Factors in Computing Systems, 2022, pp. 1–23.
[23] J. Oppenlaender, R. Linder, J. Silvennoinen, Prompting AI Art: An Investigation into the Creative Skill of Prompt Engineering,
     arXiv preprint arXiv:2303.13534 (2023).
[24] J. Oppenlaender, A Taxonomy of Prompt Modifiers for Text-to-Image Generation, arXiv preprint arXiv:2204.13988 (2022).
[25] J. Wang, E. Shi, S. Yu, Z. Wu, C. Ma, H. Dai, Q. Yang, Y. Kang, J. Wu, H. Hu, et al., Prompt Engineering for Healthcare:
     Methodologies and Applications, arXiv preprint arXiv:2304.14670 (2023).
[26] G. Amaral, O. Rodrigues, E. Simperl, WDV: A Broad Data Verbalisation Dataset Built from Wikidata, in: U. Sattler, A. Hogan,
     M. Keet, V. Presutti, J. P. A. Almeida, H. Takeda, P. Monnin, G. Pirrò, C. d’Amato (Eds.), The Semantic Web – ISWC 2022,
     Springer International Publishing, Cham, 2022, pp. 556–574.
[27] L. F. R. Ribeiro, M. Schmitt, H. Schütze, I. Gurevych, Investigating Pretrained Language Models for Graph-to-Text
     Generation, in: Proceedings of the 3rd Workshop on Natural Language Processing for Conversational AI, Asso-
     ciation for Computational Linguistics, Online, 2021, pp. 211–227. URL: https://aclanthology.org/2021.nlp4convai-1.20.
     doi:10.18653/v1/2021.nlp4convai- 1.20 .
[28] Brümmer, Martin and Dojchinovski, Milan and Hellmann, Sebastian, DBpedia Abstracts: A Large-Scale, Open, Multilingual
     NLP Training Corpus, in: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16),
     European Language Resources Association (ELRA), Portorož, Slovenia, 2016, pp. 3339–3343. URL: https://aclanthology.org/
     L16-1532.
[29] Y. Chen, X-IQE: eXplainable Image Quality Evaluation for Text-to-Image Generation with Visual Large Language Models,
     arXiv preprint arXiv:2305.10843 (2023).
[30] D. Varga, Full-Reference Image Quality Assessment Based on an Optimal Linear Combination of Quality Measures Se-
     lected by Simulated Annealing, Journal of Imaging 8 (2022). URL: https://www.mdpi.com/2313-433X/8/8/224. doi:10.3390/
     jimaging8080224 .
[31] J. Hessel, A. Holtzman, M. Forbes, R. Le Bras, Y. Choi, CLIPScore: A Reference-free Evaluation Metric for Image
     Captioning, in: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Asso-
     ciation for Computational Linguistics, Online and Punta Cana, Dominican Republic, 2021, pp. 7514–7528. URL: https:
     //aclanthology.org/2021.emnlp-main.595. doi:10.18653/v1/2021.emnlp- main.595 .
[32] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold,
     S. Gelly, J. Uszkoreit, N. Houlsby, An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, in: 9th
     International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021, OpenReview.net,
     2021. URL: https://openreview.net/forum?id=YicbFdNTTy.
[33] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, S. Hochreiter, GANs Trained by a Two Time-Scale Update Rule Converge
     to a Local Nash Equilibrium, 2018. arXiv:1706.08500 .
[34] A. F. Hayes, K. Krippendorff, Answering the Call for a Standard Reliability Measure for Coding Data, Communication
     Methods and Measures 1 (2007) 77–89. URL: https://doi.org/10.1080/19312450709336664. doi:10.1080/19312450709336664 .
     arXiv:https://doi.org/10.1080/19312450709336664 .
[35] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen, LoRA: Low-Rank Adaptation of Large Language
     Models, 2021. arXiv:2106.09685 .
A. Appendix

Table 2
Krippendorf’s alpha focused on per character agreement
                        Image                            Krippendorf alpha
                        Lancelot (Q215681)                     0.09354
                        Fëanor (Q716794)                       0.09049
                        John Sheppard (Q923684)                0.00390
                        Hoshi Sato (Q1055776)                  0.03129
                        Puck (Q1248616)                        0.04627
                        Harry Potter (Q3244512)                0.08192
                        Agramante (Q3606846)                   0.08182
                        Mariner Moose (Q5353616)               0.08347
                        Octobriana (Q7077012)                  0.04676
                        Phanuel (Q7180638)                     0.13665


Table 3
Krippendorf’s alpha focused on per prompt type agreement
                               Prompt                Krippendorf alpha
                               Basic Label                 0.17647
                               Plain Triples               0.12030
                               Verbalised Triples          0.20725
                               DBpedia Abstract            0.03761


Table 4
Comparison between generated and ground-truth images based on UQI and CLIP
 Prompt Type          min UQI      mean UQI         max UQI    min ClipSim     mean ClipSim   max ClipSim
 Basic Label             0.0          0.49970        0.80195         0.14747      0.48277        0.9590
 Plain Triples           0.0          0.49966        0.87212         0.18396      0.54535        0.88957
 Verbalised Triples      0.0          0.50075        0.86605         0.17431      0.55710        0.92599
 DBpedia Abstract        0.0          0.49151        0.78226         0.16856      0.60192        0.92902
Table 5
FID metrics computed on the human evaluation subset (note that lower numbers mean higher similarity)
 Fictionnal Character           Basic label    Plain prompt       Verbalised prompt          DBpedia abstract
 Lancelot (Q215681)             126.35087        54.766038                 96.49390                  67.89936
 Fëanor (Q716794)               119.39887        118.62600                125.80051                 123.55941
 John Sheppard (Q923684)        241.30466        236.92710                274.41479                 139.81330
 Hoshi Sato (Q1055776)          228.71975        203.96307                225.44356                 161.90406
 Puck (Q1248616)                118.86526        157.89964                137.24375                 145.40420
 Harry Potter (Q3244512)        190.61544        217.70724                197.86598                 188.29662
 Agramante (Q3606846)           125.21297        67.034159                132.19362                 110.24826
 Mariner Moose (Q5353616)       73.558398        174.05515                104.00699                  76.30708
 Octobriana (Q7077012)          78.224097        209.75924                173.35200                 204.09481
 Phanuel (Q7180638)             164.80482        75.619137                49.82890                   57.28145
 Average                        146.70551        151.63568                151.66440                 127.48086


Table 6
Analysis of variance focused on entity with abstracts (N=914) regarding the distribution of UQI and the
CLIPscore, with the prompt strategy as the main fixed effects.
        Metric       df    sum of squares      mean of squares         F value        signifiance
        CLIPscore     3         4.77477             1.59159            100.29899      2.18256e-62
        UQI           3         0.02104             0.00701             0.53132          0.66078


Table 7
Analysis of variance focused on entity without abstracts (N=586) regarding the distribution of UQI and
the CLIPscore, with the prompt strategy as the main fixed effects.
         Metric       df   sum of squares       mean of squares        F value        signifiance
         CLIPscore    2         2.29885               1.1494           77.94544       4.47908e-33
         UQI          2         0.022125             0.00737           0.51729           0.59622


Table 8
Pairwise tests using Tukey HSD related to the effect of the prompt on the CLIPscore, on the subset of
images having DBpedia abstracts
 prompt1              prompt2                       diff       lower       upper       q-value      p-value
 basic prompt         plain prompt                0.05848      0.04334     0.07363     14.03695       0.001
 basic prompt         verbalised prompt           0.06724      0.05209     0.08238     16.13806       0.001
 basic prompt         dbpedia abstract prompt     0.10023      0.08508     0.11537     24.05521       0.001
 plain prompt         verbalised prompt           0.00875      -0.00639    0.02390     2.10110       0.44767
 plain prompt         dbpedia abstract prompt     0.04174      0.02659     0.05688     10.01826       0.001
 verbalised prompt    dbpedia abstract prompt     0.03298      0.01784     0.04813     7.91715        0.001
Table 9
Pairwise tests using Tukey HSD related to the effect of the prompt on the CLIPscore, on the subset of
images do not having DBpedia abstracts
      prompt1         prompt2               diff      lower     upper      q-value    p-value
      basic prompt    plain prompt         0.06934   0.05220    0.08649    13.41698       0.001
      basic prompt    verbalised prompt    0.08605   0.06890    0.10320    16.6495        0.001
      plain prompt    verbalised prompt    0.01670   -0.00043   0.03385    3.23256       0.05810


Table 10
Correlation scores between number of relation and CLIP score for triple-based prompts
        var                                               plain triple    verbalised triples
        CLIP score Vs. Number of relations                  0.10433            0.15312
        CLIP score Vs. Number of unique relations           0.16794            0.17828
Figure 5: The 10 random generated images used for the Human Evaluation
Table 11
Student tests on plain triples on the values of the “instance of” relations appearing more than 100 times
 instance of                        sample size     mean with       mean without      t student     p-value
 graphic novel character                 120           0.53759          0.60367        -4.85939        0.0
 fictional character in comics           120           0.54465          0.60367        -4.14029      5e-05
 comic character                         120           0.54554          0.60367        -4.12248      5e-05
 comics characters                       120           0.54714          0.60367        -3.80711     0.00018
 comic strip character                   120           0.54894          0.60367        -3.78012     0.0002
 cartoon character                       151           0.54186          0.59345        -3.66697     0.00029
 comic characters                        120           0.55257          0.60367        -3.53789     0.00048
 fictional man                           604           0.56867          0.54659        3.25443      0.00117
 fictional character appearing in
 a film                                  146           0.5415           0.58396        -3.05048     0.0025
 human being that only
 exists in fictional works               604           0.56749          0.54659        2.99562      0.00279
 human fictional character               604            0.5668          0.54659        2.95531      0.00318
 comic book character                    120            0.5613          0.60367        -2.90949     0.00396
 fictional person                        604           0.56611          0.54659        2.86978      0.00418
 character in a book                     311           0.57115          0.54416        2.81644      0.00501
 animation character                     151           0.55279          0.59345        -2.77323     0.0059
 fictional persons                       604           0.56374          0.54659        2.50207      0.01248
 fictional character
 who appears in
 animated films, television, and
 other animated works                    151           0.55864          0.59345        -2.43599     0.01543
 television show character               207           0.54562          0.57471        -2.42813     0.0156
 fictional woman                         604            0.5623          0.54659        2.27222      0.02325
 fictional character who appears
 in a television series                  207           0.54721          0.57471        -2.27788     0.02325
 human fictional characters              604           0.56201          0.54659        2.26597      0.02363
 comics character                        120           0.57095          0.60367        -2.22542     0.02699
 fictional human                         604           0.56125          0.54659        2.16103      0.03089
 animated character                      151           0.56282          0.59345        -2.16122     0.03147
 cartoon characters                      151            0.5668          0.59345        -1.91238     0.05678
 TV show character                       207           0.55195          0.57471         -1.8769     0.06124
 TV character                            207           0.55358          0.57471        -1.77022     0.07743
 cinematic character                     146           0.55876          0.58396         -1.7426     0.08246
 fictional character appearing in
 written works                           311           0.56126          0.54416        1.72544      0.08495
 character in literature                 311           0.56107          0.54416        1.71958      0.08601
 movie character                         146           0.55973          0.58396        -1.69479     0.09119
 book character                          311           0.56006          0.54416        1.63574      0.1024
 literary character                      311           0.55985          0.54416        1.62197      0.10532
 novel character                         311           0.55851          0.54416        1.43847      0.15081
 literature character                    311           0.55855          0.54416        1.42897      0.15352
 film character                          146           0.56439          0.58396        -1.38642     0.16668
 TV series character                     207           0.55938          0.57471        -1.25831     0.20899
 television series character             207           0.56432          0.57471        -0.90818     0.36431
 character in a novel                    311            0.553           0.54416        0.90722      0.36464
 human
 (as opposed to supernatural)
 character in the
 Old Testament/Hebrew Bible
 or New Testament                        112           0.53849          0.55244        -0.85578     0.39304
 human biblical figure                   112           0.54104          0.55244         -0.7692     0.44259
 television character                    207           0.56607          0.57471        -0.75946     0.44801
 biblical human                          112           0.54532          0.55244        -0.47913     0.63232
 biblical human character                112           0.54976          0.55244        -0.17741     0.85935
 human in the Bible                      112           0.55174          0.55244        -0.04366     0.96521
Table 12
Student tests on plain triples on the relations appearing more than 100 times
 relation                               sample size     mean with     mean without   t student   p-value
 said to be the same as                      134          0.52624          0.59053   -4.45282     1e-05
 described by source                         200          0.54037          0.58667   -3.94038     0001
 different from                              235          0.53497          0.57629   -3.85235    00013
 topic’s main category                       135          0.54547          0.60322   -3.88675    00013
 father                                      216          0.53967          0.57929   -3.56135    00041
 from narrative universe                     392          0.53984          0.56521    -37027     00221
 place of birth                              169          0.53054          0.57046     -3382     00257
 sibling                                     157          0.53544          0.57294   -2.73217    00665
 name in native language                     155          0.54605          0.57979   -2.64531    00858
 given name                                  493          0.55413          0.53382   2.58674     00983
 enemy                                       176          0.55583          0.58792   -2.53043    01183
 child                                       181          0.54056          0.57175   -2.52187     0121
 part of                                     133          0.54286          0.57954   -2.43042    01575
 mother                                      144          0.55002          0.58305   -2.40593    01677
 media franchise                             145          0.54594          0.57799    -2.2482    02532
 first appearance                            146          0.54038          0.56965   -2.15909    03166
 present in work                             389           0.5286          0.54332   -1.69822     08987
 country of citizenship                      428          0.55127          0.53896   1.49145     0.13621
 languages spoken, written or signed         293          0.54282          0.55741   -1.45697    0.14566
 family name                                 281          0.55035          0.53763   1.23692     0.21664
 spouse                                      215          0.54113          0.55489    -1.1824    0.2377
 member of                                   136          0.55572          0.57103   -1.10956    0.26817
 narrative role                              177          0.54159          0.55615    -19498     0.27427
 residence                                   143          0.54038          0.55417   -0.95292    0.34144
 occupation                                  565           0.5409          0.54756   -0.92396    0.3557
 creator                                     588          0.54809          0.54479   0.46119     0.64475
 voice actor                                 110          0.53768          0.54433    -0.4121    0.68067
 eye color                                   113          0.54737          0.55021   -0.17031    0.86492
 sex or gender                               286          0.55365          0.55528   -0.16341    0.87025
 hair color                                  114          0.55313          0.55122   0.11972     0.90481
 performer                                   438          0.54877          0.54901    -02778     0.97785
Table 13
Student tests on verbalised triples on the values of the “instance of” relations appearing more than 100
times
 instance of                        sample size     mean with      mean without       t student    p-value
 fictional person                        604          0.55721          0.52212          52155         0
 human being that only exists in
 fictional works                         604          0.55742          0.52212          58938         0
 fictional man                           604          0.56348          0.52212         5.93812        0
 fictional persons                       604           0.5636          0.52212          62202         0
 fictional woman                         604          0.56756          0.52212         6.67694        0
 human fictional character               604          0.56339          0.52212         5.96221        0
 human fictional characters              604          0.56621          0.52212         6.46184        0
 fictional human                         604          0.56144          0.52212          5.6847        0
 human
 (as opposed to supernatural)
 character in the
 Old Testament/Hebrew
 Bible or New Testament                  112          0.51176          0.59839         -5.70389       0
 comics character                        120          0.52673          0.58356         -4.15193     5e-05
 human biblical figure                   112           0.5395          0.59839         -3.79607     00019
 fictional character who appears
 in animated films, television,
 and other animated works                151          0.52863          0.58071         -3.76422     0002
 fictional character in comics           120           0.5314          0.58356         -3.77533     0002
 fictional character appearing in
 written works                           311          0.56562          0.52885         3.72153      00022
 biblical human character                112          0.54161          0.59839         -3.57517     00043
 comic book character                    120           0.5355          0.58356         -3.41365     00075
 graphic novel character                 120          0.53706          0.58356          -3.2513     00132
 comic characters                        120          0.54027          0.58356         -3.24203     00136
 human in the Bible                      112          0.55056          0.59839         -3.18845     00164
 cartoon character                       151          0.53912          0.58071          -38265      00224
 comic character                         120          0.54249          0.58356          -2.9583     00341
 biblical human                          112          0.55704          0.59839         -2.80093     00555
 comic strip character                   120          0.54284          0.58356         -2.76604     00612
 character in a novel                    311           0.553           0.52885         2.53108      01162
 character in literature                 311           0.5534          0.52885           2.5077     01241
 cartoon characters                      151          0.54863          0.58071         -2.43456     01549
 novel character                         311          0.55218          0.52885         2.37615      0178
 comics characters                       120          0.54978          0.58356         -2.32572     02087
 animated character                      151          0.54876          0.58071         -2.26202     02441
 cinematic character                     146           0.5397          0.57174         -2.25485     02489
 literature character                    311          0.55055          0.52885         2.19677      02841
 movie character                         146          0.53912          0.57174          -2.1486     03249
 literary character                      311          0.54767          0.52885         1.89849       0581
 TV character                            207          0.56049          0.53791           1.8604     06354
 character in a book                     311          0.54643          0.52885         1.76267      07845
 fictional character appearing in
 a film                                  146          0.54601          0.57174         -1.72412     08575
 book character                          311          0.54493          0.52885         1.63332     0.10291
 film character                          146          0.54905          0.57174         -1.56495    0.11868
 fictional character who appears
 in a television series                  207          0.55602          0.53791         1.47277     0.14158
 animation character                     151          0.56582          0.58071         -1.1429     0.25399
 television character                    207          0.55208          0.53791         1.14182     0.25419
 TV show character                       207          0.52439          0.53791         -16176      0.28896
 television show character               207          0.55049          0.53791         0.99373     0.32094
 TV series character                     207          0.54779          0.53791         0.78829     0.43098
 television series character             207          0.54323          0.53791         0.44074     0.65964
Table 14
Student tests on verbalised triples on the relations appearing more than 100 times
 relation                          sample size     mean with     mean without        t student   p-value
 from narrative universe                392          0.54247          0.59269        -6.21662      0
 enemy                                  176          0.55843          0.60108        -3.18641    00157
 eye color                              113          0.54021          0.59086        -3.12825    00199
 father                                 216          0.55582          0.58778        -2.84772    00461
 media franchise                        145          0.56023          0.5985         -2.74637    00641
 present in work                        389          0.53543          0.55898        -2.72472    00658
 topic’s main category                  135          0.55616          0.59396        -2.55165    01128
 member of                              136          0.57542          0.60546        -2.36973    0185
 languages spoken, written or           293          0.55711          0.57953        -2.29727    02196
 signed
 name in native language                155          0.56111          0.58959        -2.16698      031
 mother                                 144          0.56521          0.59498        -2.16136      0315
 sibling                                157          0.54875          0.57697        -2.15316    03207
 first appearance                       146          0.55121          0.58171        -2.12306      0346
 part of                                133          0.55823          0.58861        -2.11582      0353
 occupation                             565          0.54869          0.56324         -26606     03905
 said to be the same as                 134          0.55461           0.5818        -1.91761     05623
 place of birth                         169          0.55391          0.57885        -1.90071      0582
 voice actor                            110          0.54717          0.57493         -1.7728     07766
 different from                         235          0.54767          0.56429        -1.60514    0.10914
 hair color                             114          0.56243          0.58611        -1.56321     0.1194
 given name                             493          0.56572          0.55461        1.44034     0.15009
 performer                              438          0.55708          0.56849        -1.37983    0.16799
 child                                  181          0.54882          0.56591        -1.33046    0.18421
 family name                            281          0.55684          0.56664         -11722     0.30949
 narrative role                         177          0.55689           0.5697        -0.97679    0.32934
 sex or gender                          286          0.55629          0.56362        -0.71496    0.47493
 creator                                588           0.5554          0.55862        -0.45502    0.64918
 instance of*                           27           0.58846          0.57998        0.25397     0.80052
 residence                              143          0.55984           0.5634        -0.25201    0.80121
 spouse                                 215          0.56128          0.56394        -0.23058    0.81775
 described by source                    200          0.56531          0.56379        0.13335     0.89399
 country of citizenship                 428          0.55895          0.55945         -06071     0.95161
Figure 6: Correlation plots of the automatic and the human evaluation
Table 15
Prompt examples of an entity from the generated prompt dataset.
           Fictional Character     Lancelot (Q215681)
           Basic Label             Lance Hunter
           Plain Triples           Lance Hunter instance of fictional character in comics.
                                   Lance Hunter instance of comics character. Lance
                                   Hunter instance of comic book character. Lance
                                   Hunter instance of comic character. Lance Hunter
                                   instance of comic strip character. Lance Hunter in-
                                   stance of comics characters. Lance Hunter instance of
                                   comic characters. Lance Hunter instance of graphic
                                   novel character. Lance Hunter instance of human be-
                                   ing that only exists in fictional works. Lance Hunter
                                   instance of fictional human. Lance Hunter instance
                                   of fictional person. Lance Hunter instance of fictional
                                   man. Lance Hunter instance of fictional persons.
                                   Lance Hunter instance of fictional woman. Lance
                                   Hunter instance of human fictional character. Lance
                                   Hunter instance of human fictional characters. Lance
                                   Hunter instance of TV show character. Lance Hunter
                                   instance of TV character. Lance Hunter instance of
                                   television series character. Lance Hunter instance of
                                   television show character. Lance Hunter instance
                                   of TV series character. Lance Hunter instance of
                                   fictional character who appears in a television se-
                                   ries. Lance Hunter instance of television character.
                                   Lance Hunter present in work Marvel’s Agents of
                                   S.H.I.E.L.D.. Lance Hunter from narrative universe
                                   shared fictional universe of many comic books pub-
                                   lished by Marvel Comics. Lance Hunter given name
                                   male given name. Lance Hunter sex or gender to be
                                   used in sex or gender (P21) to indicate that the hu-
                                   man subject is a male or semantic gender (P10339) to
                                   indicate that a word refers to a male person. Lance
                                   Hunter family name Hunter family name.
           Verbalised Triples      Lance Hunter is a fictional character in Marvel’s
                                   Agents of S.H.I.E.L.D. He is a character in the TV
                                   series Marvel’s Agents of S.H.I.L.D. He is also a char-
                                   acter in the comic book genre. He is also a character
                                   in the graphic novel genre.
           DBpedia Abstract        Lancelot Lance Hunter is a fictional character appear-
                                   ing in American comic books published by Marvel
                                   Comics. He first appeared in Captain Britain Weekly
                                   19 (February 16, 1977) and was created by writer Gary
                                   Friedrich and artist Herb Trimpe. Hunter is a Royal
                                   Navy Commander who became Director of S.T.R.I.K.E.
                                   before later gaining the rank of Commodore and be-
                                   coming Joint Intelligence Committee Chair. The char-
                                   acter made his live-action debut in the Marvel Cine-
                                   matic Universe television series Agents of S.H.I.E.L.D.,
                                   portrayed by Nick Blood.