From Nodes to Narratives: A Knowledge Graph-based
                                Storytelling Approach
                                Mike de Kok1,2,⇤ , Youssra Rebboud2,⇤ , Pasquale Lisena2 , Raphael Troncy2 and
                                Ilaria Tiddi1
                                1
                                    Vrije Universiteit Amsterdam, Amsterdam, The Netherlands
                                2
                                    EURECOM, Sophia Antipolis, France


                                                                         Abstract
                                                                         Narratives wield a profound influence, shaping perceptions, beliefs, and decision-making processes.
                                                                         Although contemporary pre-trained language models have showcased impressive capabilities in text
                                                                         generation and question-answering tasks, they grapple with inherent limitations in knowledge coverage
                                                                         and exhibit vulnerability to societal biases. This work endeavors to forge a methodology that applies
                                                                         Knowledge Graphs in narrative construction. Rather than solely focusing on fundamental aspects such
                                                                         as the 4W (who, what, when, where) and general relationships, our approach comprises finely detailed
                                                                         semantic relations, delineating precise type of causality such as an event preventing, intending-to-cause,
                                                                         causing, or enabling another event. Applying state-of-art methods to predict such rich information, we
                                                                         demonstrate that it is possible to obtain automatically generated narratives of better grammatical and
                                                                         semantic accuracy.

                                                                         Keywords
                                                                         Narratives, Knowledge Graphs, Information Extraction, Event-centric Knowledge Graphs


                                1. Introduction
                                Narratives stand at the heart of our societal fabric, serving our understanding and facilitating
                                the exchange and preservation of knowledge. These narratives filter through our everyday lives,
                                appearing in diverse forms such as commercials, political campaigns, news broadcasts, and
                                more, each with its unique purpose and significance. Stories hold immense power to shape our
                                thoughts, beliefs, and actions, making them captivating and transformative [1]. Consequently,
                                the quest to innovate in the realm of complex narrative generation holds the potential to
                                usher in a new era of AI systems that are intricately attuned to human sensibilities. Building
                                upon the profound role of narratives in our society, it becomes evident that our means of
                                narrative generation and comprehension are intertwined with the capabilities of modern AI.
                                Pre-trained language models (PLM), exemplified by models such as BERT [2], GPT-3 [3], and

                                In: R. Campos, A. Jorge, A. Jatowt, S. Bhatia, M. Litvak (eds.): Proceedings of the Text2Story’24 Workshop, Glasgow
                                (Scotland), 24-March-2024
                                ⇤
                                  Corresponding author.
                                � m.a.h.de.kok@student.vu.nl (M. d. Kok); youssra.rebboud@eurecom.fr (Y. Rebboud); pasquale.lisena@eurecom.fr
                                (P. Lisena); raphael.troncy@eurecom.fr (R. Troncy); i.tiddi@vu.nl (I. Tiddi)
                                � 0009-0009-9843-4707 (M. d. Kok); 0000-0003-3507-5646 (Y. Rebboud); 0000-0003-3094-5585 (P. Lisena);
                                0000-0003-0457-1436 (R. Troncy); 0000-0001-7116-9338 (I. Tiddi)
                                                                       © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                    CEUR
                                    Workshop
                                    Proceedings
                                                  http://ceur-ws.org
                                                  ISSN 1613-0073
                                                                       CEUR Workshop Proceedings (CEUR-WS.org)


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings


                                                                                                                                          29
the more recent ChatGPT (GPT-3.5)1 , have showcased remarkable progress in text generation,
and conversational tasks. Yet, these models, shaped by training on extensive datasets drawn
from undisclosed and diverse sources, bear intrinsic limitations, including knowledge gaps,
inaccuracies, and societal biases [3, 4]. Their challenges in maintaining semantic coherence
and capturing long-term dependencies within text generation further underscore the need for
innovation in narrative crafting [5, 6].
   Knowledge Graphs (KGs) are proven to be suitable structures for human knowledge, designed
for machine-readability and adaptability, while several experiments of text generation from
KGs are present in the literature [7]. Several KGs are available as data sources for the automatic
generation of narratives. For example, EventKG [8] is a knowledge graph that consolidates and
links events extracted from diverse sources, including Wikidata and YAGO [9]. This knowledge
graph comprises more than 1.3 million events, each associated with its respective spatial and
temporal coordinates. However, EventKG primarily focuses on representing events attributed
and relationships between sub-events and super-events. While the value of such a knowledge
graph is undeniable, its limitation to specific event properties, notably the sub(super)events or
the 4W, results in succinct and somewhat incomplete narratives.
   Instead, the FARO dataset [10, 11] encompasses a broader spectrum of semantically precise
relationships. This includes event-related connections such as Prevention, Enabling, Causality,
and Intention. In this work, we propose to enhance the WebNLG dataset [12], by incorporating
the FARO dataset. This augmentation aims to generate text with more detailed semantics,
particularly focusing on causal, preventive, intentional, and enabling relationships within a
specified subgraph of events. You can locate the implementation code, and the appendix at
https://github.com/ANR-kFLOW/KG2Narrative.
   The remainder of this paper is structured as follows: we first review the prior research pertain-
ing to narratives and the extraction of relevant information from KGs (Section 2). We present
datasets in Section 3, and we detail our approach for KG summarization, which encompasses
an initial information selection step before text generation in Section 4. We then present both
qualitative and quantitative results in Section 5. We conclude and outline some future work in
Section 6.


2. Related Work
A narrative graph [13] incorporates two main components: the individual representation of
events, including the “four W" aspects (who, what, when, where) and the interconnection of
these events through temporal and causal relationships. The Simple Event Model (SEM) [14]
provides a foundation for modeling events, but is still insufficient to link disparate events or
classes of the same type. To address this limitation, Blin [13] suggests enriching the event
relation types: temporal or causal links from Allen [15] and dbo:alongside links between
classes of the same type. Furthermore, the FARO ontology2 [10] covers most of the existing
event relations in the literature, from temporal relation to causal and more fine-grained ones
such as prevention.
1
    https://openai.com/blog/chatgpt/
2
    https://anr-kflow.github.io/faro/


                                                30
   KG summarization is an initial step of information retrieval and selection. To acquire the
essential nodes for event description, an effective approach involves ranking techniques that
assign significance to nodes based on the relationships they possess. Various methods can be
used such as entity ranking, relationship ranking, and semantic document ranking [16]. Blin
et al. propose a system that can identify relevant information needed to build a narrative graph,
by using an informed graph search traversal strategy [17]. To determine which information
is considered ‘relevant’ the method uses filters to prune the search space with respect to the
Simple Event Model (What, Who, Where, When).
   On the other hand, different methods for generating texts from knowledge graphs have been
proposed. In [18], triples are extracted to fine-tune a GPT-2 model [19], making the model
dependent on the input triples. A similar approach is introduced in [20], involving BART [21]
and T5 [22]. This approach obtained state-of-the-art performances on the AGENDA dataset [23]
but not on the WebNLG dataset. Both found that Pre-trained Language Models (PLM) work well
on unordered representations of the graph. JointGT [24] uses BART and T5, and exploits new
pre-training methods to explicitly preserve the input graph’s structural information. JointGT
outperforms the other mentioned technique on WebNLG, which might indicate that including
the topology of the graph lead to better results. A different approach [25] uses a transformer
encoding structure to encode both the global information and the local topology information,
and feeds a transformer to decode and generate text. However, this did not work as well as the
previously mentioned technique [20], which used a PLM model without encoding. This might
indicate that PLMs obtain better results than self trained transformer models.


3. Dataset
In this section, we present the datasets that we used to train our method: WebNLG [26] and the
FARO dataset [11] (Table 1). For evaluation, we use two evaluation datasets: the FARO test set
and the ASRAEL KG [27]. ASRAEL is a knowledge graph that includes various event-related
articles and their interconnections, including the 4W relations.

Table 1
Sample of the FARO dataset.
  Sentence                                   Trigger1   Trigger2   Tag       Triplets
  The government has implemented a series                                    <triplet>laws <subj>
                                             laws       abuse      prevent
  of laws to prevent the abuse of animals.                                   abuse <obj>prevent

  Before our evaluation, ASRAEL lacked precise semantic relations. Therefore, we had to
extract these relations from the event articles (linked to the KG) to conduct the assessment. We
enhanced the ASRAEL KG with these extracted additional relations (similarly to the ones in
FARO), resulting in a more dense and comprehensive knowledge graph. To achieve this objective,
we used a pre-trained REBEL model [28] to extract events and relations (cause, enable, prevent,
and enable). Furthermore, we leverage an existing event co-reference resolution model [29] to
perform the task within the KG. This model creates clusters of mentions, computes similarity
scores for each cluster, merges those with the highest score, and repeats this process until the
score fell below a defined threshold, which we empirically set to 0.95. This clustering process


                                                31
resulted in a graph primarily composed of clusters with a single mention, which are due to not
finding a similar match. According to our manual assessment, the algorithm matched correctly
a large number of syntactic matches, which makes it trustworthy. In total, we successfully
clustered 45,031 mentions, with 36,057 being unique. The resulting narrative graph3 provides
a RDF representation of event co-references and relationships, enriched with ontologies such
as NIF (NLP Interchange Format4 ), SEM and FARO to describe the relations between triples,
further enhancing the context and meaning of our knowledge graph. A global overview of a
narrative graph, and a concrete example can be found in the appendix.


4. Knowledge graph summarization
Knowledge Graph summarization comprises two tasks: the selection of pertinent information
from the knowledge graph, and the text generation based on the extracted data.

4.1. Relevant Information Selection
A SPARQL query has been written to extract the essential nodes, such as persons, places, and
times, crucial for narrative construction from a main event within the ASRAEL KG. This query
prioritizes the selection of events involving the 4W nodes with higher frequencies of incoming
edges. Mentions are selected similarly; the larger the cluster of co-referent mentions (formed
by the event co-reference model) is, the higher the priority of said cluster. Since we face a
limitation on the number of input tokens of the text generation model, up to three mentions are
selected from the same cluster.
   The quality of the output depends largely on the quality the output of previous steps (relation
extraction and co-reference resolution). Future work aims to enhance the accuracy of both these
tasks and explore methods for identifying indirectly linked relevant nodes to selected events.

4.2. Text Generation from Knowledge Graphs
As anticipated in Section 2, using a PLM instead of training a language model from scratch can
lead to better results. Furthermore, incorporating the graph’s topology into the model has been
shown to generate better natural text. JointGT [24] incorporates both of these characteristics,
hence, we adopted this method. The authors pre-trained this model on the KGText dataset [30],
consisting of 7 million graph-text pairs extracted from English Wikipedia dump.5 It includes
around 1.8 million entities and 1,210 relations.
   The WebNLG dataset does not contain any of the FARO relations. Therefore, we fine-tuned
the model on a merged dataset, combining the WebNLG and FARO, as in Table 2 without making
changes to the model itself. The creation of this combined dataset involves the following multi-
step process. Initially, entities and their respective encodings are extracted from the WebNLG
dataset. Subsequently, entities from the FARO dataset are encoded utilizing the extracted

3
  https://github.com/ANR-kFLOW/KG2Narrative/blob/main/Data/graphs/final_generated/eag_complete_merged.
  ttl
4
  https://persistence.uni-leipzig.org/nlp2rdf/
5
  https://dumps.wikimedia.org/


                                                 32
encodings from WebNLG. Finally, the resulting encodings and their relations are integrated
into the original WebNLG dataset, thereby producing the combined dataset.

Table 2
Sizes of the datasets used for training and evaluating the JointGT model.
                                       Dataset        Train        Val     Test
                                       WebNLG         12,876     1,619    1,600
                                       FARO            1,800       201      108
                                       Combined       14,676     1,820    1,708

  The model undergoes fine-tuning on the WebNLG dataset. We refer to the original model as
base model, and the model fine-tuned on the combined dataset as combined model.6


5. Results
5.1. Quantitative analysis

Table 3
The performance metrics of the best performing model on their corresponding validation and test set –
either WebNLG or the combined set. Both models are evaluated also on the FARO test set.
          Model             Dataset       BLEU METEOR ROUGE Step                   Epoch
                            Val           0.6642 0.4727        0.7558     22400 6
          Base (WebNLG) Test              0.6529 0.4681        0.7535     -        -
                            FARO test 0.0        0.0565        0.1299     -        -
                            Val           0.6368 0.4543        0.7468     36000 9
          Combined          Test          0.6101 0.4409        0.7260     -        -
                            FARO test 0.0477 0.0877            0.1949     -        -

   Table 3 provides crucial insights into the model’s performance, measured by the BLEU,
METEOR, and ROUGE metrics. BLEU emphasizes precision, indicating how accurately the
generated text aligns with the reference text. On the other hand, ROUGE focuses on recall,
gauging the extent to which the reference text is captured in the generated output. METEOR
combines elements of both precision and recall, and its effectiveness can be further enhanced by
incorporating improved word matching strategies. ROUGE suggests a high level of alignment
with reference texts in conveying information, while BLEU shows minor word deviations from
references. The lower METEOR score might stem from alignment nuances in score calculation.
Notably, the base model’s test performance closely mirrors the results outlined in the original
JointGT paper [24]. The model that was trained on the combined dataset performed slightly
worse for all three metrics than the model that was trained on the base WebNLG data. This can
be explained by two considerations. First, it is evident in Table 3 that tests on FARO have very
low performances. Secondly, the FARO dataset only accounts for a relatively small proportion
6
    The model was replicated using the same parameters from the original paper, except for the batch size lowered due
    to memory constraints. The parameters are Learning rate: 0.000025, Batch size: 4, Epochs: 10, Optimizer: Adam.
    Early stopping: 10 epochs


                                                          33
in the combined dataset (Table 2). To better understand the reasons, a qualitative analysis is
proposed in the next section.

5.2. Qualitative analysis
We examine instances from WebNLG and FARO datasets to analyze the base and combined
model’s performance. Observing Tables 4 and 5, the text generated by the combined dataset-
trained model appears more semantically robust. The base model’s generated text for FARO
triples (Table 4, column Base generated) is notably brief, often mirroring the triples with semantic
inaccuracies. Conversely, the combined model produces more coherent and accurate sentences
in the same dataset (column Combined generated), maintaining triple direction. However, it’s
important to note that while the generated content respects triple order and semantic accuracy,
it may still have limitations in altering the original label’s content.
   We also get a sight why the quantitative results are slightly worst for the combined model.
The WebNLG data (Table 5) contains multiple triples per instance, giving more information
about the text, and contains multiple labels. The FARO data (Table 4) contains only one triple per
instance, together with one target sentence (label). Therefore, the model has less information
about what to generate, and less chances to match the target label. Looking at the FARO input
triples and the target label, it can be seen that the relationship (predicate) is often not explicitly
represented by a particular word in the target sentence (implicit relation), making the evaluation
with matching words harder. We provide additional insights in the appendix.

User Evaluation on ASRAEL To evaluate the system’s performance, seven events from the
ASRAEL dataset have been selected based on several criteria: values for the 4W properties,
linking to a minimal number of articles, etc. The two largest (in terms of having the most
articles) events in ASRAEL having all of the 4W properties are selected for evaluation: “Operation
Breaking Dawn", and “2021 storming of the United States Capitol". The rationality behind this is
to ensure that the information selection method is challenged by having an extensive amount of
information to choose from. Among the remaining events in ASRAEL that include information
about the place and time, five additional events are selected, bringing the total to seven.
   The information selection method is used to select time, place, actor, and up to three mentions
from the seven selected events. The base and combined models are used to generate text from the
selected information. This information per event can be found in the appendix, together with the
generated text. A manual evaluation was needed due to the absence reference text for automated
metrics. Three annotators with a proficient level of English fluency determined which text
was better for each event, by using either “win", “lose", or “tie", assessing fluency (grammatical
correctness) and adequacy (correct integration of triples). This method aligns with the approach
in [24]. Majority voting determined the winner or equality between models, followed by a non-
parametric sign test at a significance level of ↵ = 0.05 to establish superiority. The non-parametric
statistical sign-test is used to compare data. It assesses whether the median difference between
observations differs significantly from zero, providing a p-value that indicates the probability of
observing the given difference or a more extreme difference if the null hypothesis (no difference)
were true. The significance level, denoted by alpha ↵, is a predetermined threshold set at 0.05,


                                                 34
                   Table 4: Sample of the FARO test-set and the generated output of the base and combined model.
Triple                      Label                                              Base generated                 Combined generated
                                                                                                              The company has also announced
                            (The directors said if Messrs. Drabinsky
                                                                                                              that it will offer a new credit
                            and Gottlieb mail an offer to shareholders by      The cause of the offer is to
(offer, cause, reimburse)                                                                                     facility to small businesses, in an
                            Nov. 22, it will reimburse them a maximum          reimburse .
                                                                                                              effort to reimburse them for the cost
                            of C$8.5 million for expenses related to a bid.)
                                                                                                              of capital expenditures.


                                                                                                                                                      35
                        Table 5: Sample of the WebNLG Test-set and the generated output of the base model.
Triple                              Label                                                Generated
(3Arena, owner, Live Nation         (The owner of 3Arena , Dublin , Leinster , Republic
Entertainment), (Dublin, is part of Ireland is Live Nation Entertainment .), (Dublin     3Arena is located in Dublin , Leinster ,
of, Republic of Ireland),           is part of Leinster and a city in the Republic of    Republic of Ireland and is owned by
(3Arena, location, Dublin),         Ireland . Dublin is also home to the 3Arena which    Live Nation Entertainment .
(Dublin, is part of, Leinster)      is currently owned byLive Nation Entertainment.)
against which the p-value is compared to determine statistical significance. Results of this
annotation are accessible in Table 6.
   The combined model produces better fluent text than the base model in 71.4% of the cases.
The non-parametric “sign test" was performed to measure a significant difference in the fluency
of the text. With a p-value of 0.11, no significant difference was found. The same was done to
gauge the text’s adequacy. With a p-value of 0.25, no significant difference was found.

Table 6
Fleiss’ Kappa () indicates perfect, and moderate agreement between annotators. The wins, losses, and
ties when comparing the combined model against the base model are indicated in percentages. No
model was significantly better than another with a significance level of 0.05.
                                      Fluency                             Adequacy
              Model                                                                       
                             Win %     Lose %      Tie %          Win %     Lose % Tie %
        Combined vs Base     71.4      14.3        14.3     1.0   28.6      0.0    71.4    0.6


Table 7
BLEU, METEOR, and ROUGE scores per model on the generated text from the article.
                             Model        BLEU       METEOR        ROUGE
                             Combined     0.1681     0.2081        0.3622
                             Base         0.1874     0.2273        0.3738


Table 8
Fleiss’ Kappa () indicates substantial agreement between annotators. The wins, losses, and ties when
comparing the combined model against the base model are indicated in percentages. The combined
model was significantly better than the base model in generating adequate sentences.
                                     Fluency                              Adequacy
             Model                                                                        
                            Win %     Lose %    Tie %             Win %     Lose % Tie %
       Combined vs Base     33.3      16.7      50.0       0.73   58.3      8.3    33.3    0.61


User Evaluation on an Manually Annotated Event To demonstrate whether the obtained
results are consistent independently from the quality of the information extraction output, we
decided to perform a user evaluation on a single article (sample), which has been manually
annotated by handcrafting the resulting subgraph. This subgraph has been processed with
both the combined and base model, and then evaluated using either “win", “lose", or “tie", in
the same way as described in the previous section. The percentage of wins, losses and ties for
the combined model, together with the Fleiss’ kappa are reported in Table 8. The combined
model has been assigned more wins for producing fluent and adequate text. The non-parametric
“signed test" is applied to test if this is significant, again, with a significance level of 0.05. With
a p-value of 0.34, no significant difference is found in generating more fluent texts between
models. With a p-value of 0.04, a significant difference is found in generating more adequate
sentences by the combined model, compared to the base model.


                                                   36
  BLEU, METEOR, and ROUGE metrics have been computed using the sentences from the
article as “reference label". These scores are detailed in Table 7. This illustrates that the base
model performs slightly better than the model that was trained on the combined data. A reason
for this could be formulated by looking at the generated texts, which can be found in the
appendix. More often than the combined model, the base model will output parts of the triple
without taking the relationship between them into account. This will result in a badly formed
sentence, but higher metrics, since more triples are incorporated. This is also reflected in the
scores in Table 8, where the combined model is commonly noted for producing more fluent
texts. Furthermore, the scores in Table 7 (computed on a single annotated article) are much
lower then those computed on the whole WebNLG test set (Table 3). This outcome could be
expected, considering that some of the triples extracted from the article are not, or to a limited
extend, present in the original WebNLG data used to pre-train the JointGT model.


6. Conclusion and Future Work
The primary goal of this research is to investigate how to build complex narratives in the form of
graphs of events, generating text with good level of complexity and semantic richness, expecting
the system to generate answers beyond only What (event), Who (actor), Where (location), and
When (time).
   We enhanced the WebNLG dataset through the incorporation of the FARO dataset, aimed at
refining the semantic depth of event relations. The expanded dataset now encompasses intricate
relations including causality, prevention, intention, and enabling. Even if the metrics show not
clear improvement, from qualitative analysis, we can state that training on precise event relations
produces more complete generated sentences, while no statistically significant difference was
observed on fluency. Future work will experiment on more data to draw final conclusions. Our
information selection from the graph focuses solely on the main event, disregarding pertinent
details from interconnected events. Additionally, the data used for fine-tuning differs from the
original dataset in terms of triple counts and instances, potentially impacting model evaluation.
Future research could explore selectively extracting sub-events and relations at the document
level to enhance clustering. Moreover, augmenting the dataset through NLP techniques could
significantly improve its quality and comprehensiveness.


Acknowledgements
This work has been partially supported by the French National Research Agency (ANR) within
the kFLOW project (Grant n° ANR-21-CE23-0028) and the European Union Horizon 2020
research and innovation programme within the MUHAI project (Grant n° 951846).


References
 [1] M. C. Green, T. C. Brock, G. F. Kaufman, Understanding media enjoyment: The role of
     transportation into narrative worlds, Communication theory 14 (2004) 311–327.


                                                37
 [2] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: Pre-training of Deep Bidirectional
     Transformers for Language Understanding, in: 2019 Conference of the North American
     Chapter of the Association for Computational Linguistics: Human Language Technologies,
     NAACL-HLT, ACL, 2019, pp. 4171–4186. doi:10.18653/v1/n19-1423.
 [3] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan,
     P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan,
     R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin,
     S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, D. Amodei,
     Language Models are Few-Shot Learners, in: Advances in Neural Information Processing
     Systems, volume 33, Curran Associates, Inc., 2020, pp. 1877–1901.
 [4] J. Dodge, M. Sap, A. Marasovi , W. Agnew, G. Ilharco, D. Groeneveld, M. Mitchell, M. Gard-
     ner, Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled
     Corpus, in: 2021 Conference on Empirical Methods in Natural Language Processing
     (EMNLP), ACL, 2021, pp. 1286–1305. doi:10.18653/v1/2021.emnlp-main.98.
 [5] J. Li, T. Tang, W. X. Zhao, J.-R. Wen, Pretrained Language Model for Text Generation:
     A Survey, in: Thirtieth International Joint Conference on Artificial Intelligence (IJCAI),
     Survey Track, IJCAI Organization, 2021, pp. 4492–4499. doi:10.24963/ijcai.2021/
     612.
 [6] N. Malkin, Z. Wang, N. Jojic, Coherence boosting: When your pretrained language model is
     not paying enough attention, in: S. Muresan, P. Nakov, A. Villavicencio (Eds.), Proceedings
     of the 60th Annual Meeting of the Association for Computational Linguistics (Volume
     1: Long Papers), Association for Computational Linguistics, Dublin, Ireland, 2022, pp.
     8214–8236. doi:10.18653/v1/2022.acl-long.565.
 [7] S. Ji, S. Pan, E. Cambria, P. Marttinen, P. S. Yu, A Survey on Knowledge Graphs: Rep-
     resentation, Acquisition, and Applications, IEEE Transactions on Neural Networks and
     Learning Systems 33 (2022) 494–514. doi:10.1109/TNNLS.2021.3070843.
 [8] S. Gottschalk, E. Demidova, EventKG – the hub of event knowledge on the web – and
     biographical timeline generation, Semantic Web 10 (2019) 1039–1070. 6.
 [9] J. Hoffart, F. M. Suchanek, K. Berberich, G. Weikum, YAGO2: A spatially and temporally
     enhanced knowledge base from Wikipedia, Artificial Intelligence 194 (2013) 28–61.
[10] Y. Rebboud, P. Lisena, R. Troncy, Beyond Causality: Representing Event Relations in
     Knowledge Graphs, in: Knowledge Engineering and Knowledge Management, Springer,
     2022, pp. 121–135. doi:10.1007/978-3-031-17105-5_9.
[11] Y. Rebboud, P. Lisena, R. Troncy, Prompt-based Data Augmentation for Semantically-
     Precise Event Relation Classification, in: Semantic Methods for Events and Stories work-
     shop (SEMMES), Hersonissos, Greece, 2023.
[12] C. Gardent, A. Shimorina, S. Narayan, L. Perez-Beltrachini, Creating Training Corpora for
     NLG Micro-Planners, in: Proceedings of the 55th Annual Meeting of the Association for
     Computational Linguistics (Volume 1: Long Papers), ACL, Vancouver, Canada, 2017, pp.
     179–188. doi:10.18653/v1/P17-1017.
[13] I. Blin, Building Narrative Structures from Knowledge Graphs, in: The Semantic Web:
     ESWC 2022 Satellite Events, Springer, Germany, 2022, pp. 234–251.
[14] W. R. van Hage, V. Malaisé, R. Segers, L. Hollink, G. Schreiber, Design and use of the
     Simple Event Model (SEM), Journal of Web Semantics 9 (2011) 128–136.


                                              38
[15] J. F. Allen, Maintaining Knowledge about Temporal Intervals, Commun. ACM 26 (1983)
     832–843. doi:10.1145/182.358434.
[16] V. Jindal, S. Bawa, S. Batra, A review of ranking approaches for semantic search on Web,
     Information Processing & Management 50 (2014) 416–425.
[17] I. Blin, I. Tiddi, R. van Trijp, A. ten Teije, Identifying graph traversal strategies to build
     narrative graphs, Under review (2023).
[18] X. Yang, I. Tiddi, Creative Storytelling with Language Models and Knowledge Graphs,
     in: S. Conrad, I. Tiddi (Eds.), CIKMW2020 Proceeding of the CIKM 2020 Workshops,
     CEUR Workshop Proceedings, CEUR-WS, 2020, pp. 1–9. 2020 International Conference on
     Information and Knowledge Management Workshops, CIKMW 2020 ; Conference date:
     19-10-2020 Through 23-10-2020.
[19] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, Language Models are
     Unsupervised Multitask Learners, OpenAI blog 1.8 (2019).
[20] L. F. R. Ribeiro, M. Schmitt, H. Schütze, I. Gurevych, Investigating Pretrained Language
     Models for Graph-to-Text Generation, in: 3rd Workshop on Natural Language Processing
     for Conversational AI, ACL, Online, 2021, pp. 211–227.
[21] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, L. Zettle-
     moyer, BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Gener-
     ation, Translation, and Comprehension, in: D. Jurafsky, J. Chai, N. Schluter, J. Tetreault
     (Eds.), Proceedings of the 58th Annual Meeting of the Association for Computational
     Linguistics, Association for Computational Linguistics, Online, 2020, pp. 7871–7880.
     doi:10.18653/v1/2020.acl-main.703.
[22] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, P. J. Liu,
     Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, J.
     Mach. Learn. Res. 21 (2020).
[23] R. Koncel-Kedziorski, D. Bekal, Y. Luan, M. Lapata, H. Hajishirzi, Text Generation from
     Knowledge Graphs with Graph Transformers, in: J. Burstein, C. Doran, T. Solorio (Eds.),
     Proceedings of the 2019 Conference of the North American Chapter of the Association for
     Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short
     Papers), Association for Computational Linguistics, Minneapolis, Minnesota, 2019, pp.
     2284–2293. doi:10.18653/v1/N19-1238.
[24] P. Ke, H. Ji, Y. Ran, X. Cui, L. Wang, L. Song, X. Zhu, M. Huang, JointGT: Graph-Text Joint
     Representation Learning for Text Generation from Knowledge Graphs, in: Findings of the
     Association for Computational Linguistics: ACL-IJCNLP 2021, ACL, 2021, pp. 2526–2538.
[25] L. Liu, M. He, G. Xu, M. Tan, Q. Wu, How to Train Your Agent to Read and Write, in:
     AAAI Conference on Artificial Intelligence, 2021, pp. 13397–13405.
[26] C. Gardent, A. Shimorina, S. Narayan, L. Perez-Beltrachini, The WebNLG Challenge:
     Generating Text from RDF Data, in: 10th International Conference on Natural Language
     Generation, ACL, Santiago de Compostela, Spain, 2017, pp. 124–133.
[27] C. Rudnik, T. Ehrhart, O. Ferret, D. Teyssou, R. Troncy, X. Tannier, Searching News Articles
     Using an Event Knowledge Graph Leveraged by Wikidata, in: 2019 World Wide Web
     Conference, WWW, ACL, New York, NY, USA, 2019, p. 1232–1239.
[28] P.-L. Huguet Cabot, R. Navigli, REBEL: Relation Extraction By End-to-end Language
     generation, in: Findings of the Association for Computational Linguistics: EMNLP 2021,


                                               39
     ACL, 2021, pp. 2370–2381. URL: https://aclanthology.org/2021.findings-emnlp.204.
[29] S. Barhom, V. Shwartz, A. Eirew, M. Bugert, N. Reimers, I. Dagan, Revisiting Joint Modeling
     of Cross-document Entity and Event Coreference Resolution, in: 57th Annual Meeting of
     the Association for Computational Linguistics, ACL, Florence, Italy, 2019, pp. 4179–4189.
[30] W. Chen, Y. Su, X. Yan, W. Y. Wang, KGPT: Knowledge-Grounded Pre-Training for Data-
     to-Text Generation, in: 2020 Conference on Empirical Methods in Natural Language
     Processing (EMNLP), ACL, Online, 2020, pp. 8635–8648.


                                              40