=Paper= {{Paper |id=Vol-2699/paper07 |storemode=property |title=Creative Storytelling with Language Models and Knowledge Graphs |pdfUrl=https://ceur-ws.org/Vol-2699/paper07.pdf |volume=Vol-2699 |authors=Xinran Yang,Ilaria Tiddi |dblpUrl=https://dblp.org/rec/conf/cikm/YangT20 }} ==Creative Storytelling with Language Models and Knowledge Graphs== https://ceur-ws.org/Vol-2699/paper07.pdf
Creative Storytelling with Language Models and
Knowledge Graphs
Xinran Yanga , Ilaria Tiddia
a Vrije Universiteit Amsterdam, De Boelelaan 1105, 1081 HV Amsterdam, The Netherlands



                                       Abstract
                                       Automated story generation is a popular and well-recognized task in the field of natural language processing. The emergence
                                       of pre-trained language models based on large Transformer architectures shows the great capability of text generation. How-
                                       ever, language models are limited when the generation requires explicit clues within the context. In this research, we study
                                       how to combine knowledge graphs with language models, and build a creative story generation system named DICE. DICE
                                       uses external knowledge graphs to provide context clues and implicit knowledge to generate coherent and creative stories.
                                       The evaluation shows that our approach can effectively inject the knowledge from knowledge graphs into the stories auto-
                                       matically generated by the language model.

                                       Keywords
                                       knowledge graph, language model, story generation, natural language generation


1. Introduction
Story generation is a challenging task that requires
reasonable and relevant content in the generated sen-
tences as well as dealing with logic and implicit in-
formation (Guan et al. 2019). After large-scale pre-
trained language modes like OpenAI GPT-2 (Radford
et al. 2019) and BERT (Devlin et al. 2018) have been re-
leased in recent years, machines have shown the abil-
ity to generate a paragraph of understandable text ac-
cording to a given topic. These language models are
able to generate mostly-grammatical sentences with
nearly perfect syntax and punctuation (Koncel-Kedziorski
et al. 2019). However, the text generated by these
language models often lacks commonsense knowledge
                                                                                      Figure 1: An example of the story generation. The orange
(Logan et al. 2019) and it is hard to control the content words are the keywords provided by the user, and the blue
of the automatically generated text. To solve the prob- words are the extended entities and relations from the DICE
lem, one solution is to take advantage of structured knowledge graph. These words are connected as knowledge
inputs, such as tabular inputs and knowledge graphs graphs (SVO triples). “#i” indicates the sentence is the i-th
(Koncel-Kedziorski et al. 2019). Meanwhile, one of the sentence of the story.
most popular methods to combine language models
and knowledge graphs, is using knowledge graph em-
beddings. However, creating embeddings for knowl-
edge graphs is a complex and time-consuming process; embedding approaches.
moreover, knowledge graphs tend to be often updated, We aim to answer the following research questions:
and new embeddings have to be created (Wu et al. Q1. How to combine the language model with knowl-
2019). This research introduces a new method to com- edge graphs for the story generation without knowl-
bine knowledge graphs with language models without edge graph embeddings? Q2. What are the advantages
                                                                                      and disadvantages of using knowledge graphs to auto-
Proceedings of the CIKM 2020 Workshops, October 19-20, 2020,                          matically generate a story?
Galway, Ireland
email: x6.yang@student.vu.nl (X. Yang); i.tiddi@vu.nl (I. Tiddi)
                                                                                         We propose a two-layer system called DICE, which
url: https://kmitd.github.io/ilaria/ (I. Tiddi)                                       contains a knowledge enrichment layer and a text gen-
orcid: 0000-0001-7116-9338 (I. Tiddi)                                                 eration layer, applying the knowledge graph and the
         © 2020 Copyright for this paper by its authors. Use permitted under Creative
         Commons License Attribution 4.0 International (CC BY 4.0).                   language model respectively, to generate coherent and
 CEUR
 Workshop
 Proceedings
               http://ceur-ws.org
               ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
creative stories1 . Figure 1 presents an example of the Existing natural language generation systems are of-
story generation process. In the example, the system ten limited when the tasks require higher levels of cre-
takes 4 keywords as an input, then enriches the key- ativity and originality (Jain et al. 2017). Pre-trained
words with the knowledge graph and constructs subject- language models based on large Transformer architec-
verb-object (SVO) triples, the latter will be used as a tures (Vaswani et al. 2017), such as GPT-2 and BERT,
prompt for the language model to generate stories.       can be a potential solution for this problem. Recently,
   The current work explores the possibility of con- the OpenAI team has announced the upgraded GPT-3
necting the knowledge graph and the language model (Brown et al. 2020) with 175 billion parameters which
with an interface. The advantage of using an inter- is 100 times larger than the previous version, GPT-2.
face is that the language model can rapidly adapt to These language models show impressive text gener-
the changes from an updated knowledge graph. For ation capabilities that can achieve state-of-the-art re-
the knowledge enrichment layer, we implemented two sults without extra training (Keskar et al. 2019). How-
versions of DICE knowledge graph. For version1, we ever, these language models perform poorly when cap-
retrieve the knowledge from ConceptNet (Speer et al. turing the long tail of rare entities such as numbers and
2017) and WordNet (Miller 1995) and construct an inte- dates (Logan et al. 2019). Moreover, these models are
grated knowledge graph of commonsense knowledge unable to build context clues and use implicit knowl-
for the story generation. For version2, we enriched the edge to generate a reasonable story ending (Guan et
knowledge graph of version1 by using DBpedia facts. al. 2019).
For the text generation layer, we choose ROCStories2
(Mostafazadeh et al. 2016) as our story corpus to fine- 2.2. Text Generation with Knowledge
tune the language model, GPT-2. More details are dis-
                                                                Graph Embeddings
cussed in Section 3.
   The contributions are as follows:                     The problem mentioned above can be improved by com-
                                                         bining language models with knowledge graphs, where
     • We propose a new way of combining knowledge the former can facilitate the knowledge extracted from
       graphs and language models for text generation knowledge graphs. For example, Logan et al. (2019)
       without using knowledge graph embeddings. The built the knowledge graph language model (KGLM)
       results show that we can effectively inject the that could select and copy related facts from a knowl-
       knowledge from knowledge graphs into the au- edge graph. Ostendorff et al. (2019) enriched BERT
       tomatically generated stories as a background or with knowledge graph embeddings for document clas-
       a plot and therefore control the content of these sification and got better results than the standard BERT
       stories to some extent.                           approach. Meanwhile, Koncel-Kedziorski et al. (2019)
     • We introduce a fine-tuned model which accepts introduced a new attention model for graph encod-
       SVO triples as a prompt instead of sentences used ing and used it for the graph-to-text generation. The
       by original GPT-2 models, to generate reason- main shortcoming of these models is their high cost of
       able and creative stories with the context pro- computational resources which leads to a long train-
       vided by the SVO triples.                         ing and task execution time (Yao et al. 2019). Koncel-
                                                         Kedziorski et al. (2019) also showed that their pro-
                                                         posed model failed to mention 40% of entities in the
2. Related Work                                          knowledge graphs in the generated text.

2.1. Text Generation using Language                            2.3. Knowledge Enrichment with
     Models                                                         Knowledge Graphs
Story generation is a knowledge-intensive process (Li          Hsu et al. (2019) proposed the distill-enrich-generate
et al. 2013). In particular, open story generation re-         framework that using knowledge graphs to enrich the
quires artificial intelligence systems to create narra-        words distilled from the input images and then gen-
tives about any topic without a pre-defined domain             erating stories. Liu et al. (2019) used external knowl-
model (Li et al. 2013). Meanwhile, a creative story            edge graphs to enrich the input sentence as a sentence
should be both novel and appropriate (Sternberg 1999).         tree for solving NLP tasks such as classification and se-
                                                               quence labeling. Guo et al. (2019) built a poetry knowl-
   1 Data and code available at github.com/ranyxr/dice_story   edge graph for keyword mapping, extension, and se-
   2 https://cs.rochester.edu/nlp/rocstories/                  lection to generate Chinese classical poems with high
                                                               guage model fine-tuning is the pre-processing step for
                                                               story generation, which includes two stages: SVO triple
                                                               extraction and fine-tuning. Story generation also has
                                                               two stages, i.e., knowledge enrichment and text gen-
                                                               eration. In the next section, we will discuss each stage
                                                               in detail.

                                                               3.2. Language Model Fine-tuning
                                                               The OpenAI team has released GPT-3, the GPT-2’s suc-
                                                               cessor, but it was not available when we conducted the
                                                               research. As a result, we choose GPT-2 as the natural
                                                               language generator. OpenAI has released 4 versions of
Figure 2: Two-layer architecture of DICE system. The green     GPT-23 : the small version with 124M parameters, the
arrows indicate the workflow of the knowledge enrichment       medium version with 355M parameters, the large ver-
process. The purple arrows indicate the workflow of the text   sion with 774M parameters, and the XL version with
generation process. The blue arrows indicate the workflow
                                                               1.5B parameters. Considering the large amount of train-
of the language fine-tuning process.
                                                               ing data (the encoded story corpus is 19M), we choose
                                                               the medium version of GPT-2 to strike the balance of
                                                               speed, size, and creativity. An open-source Python
quality and relevance. Similarly, Zhou et al. (2020) re-       package, gpt-2-simple4 , is used to support the fine-
sort to a knowledge graph that consists of a collection        tuning and text generation process. Meanwhile, we
of head-relation-tail triples to retrieve related topics in    choose the ROCStories as our story corpus, which con-
their intelligent dialogue system.                             tains nearly 10 thousand short stories, each story in-
Different from some researches above, instead of de-           cludes a title and five-sentence content.
livering a graph-to-text task which emphasizes the ex-
plicit translation from graph to text without creative
                                                               3.2.1. SVO Triple Extraction
writing, this study puts more focus on using informa-
tion from knowledge graphs to provide a background       After acquiring the story corpus, we need to encode
or a plot for the language model as guidance or inspi-   the dataset into a format that allows GPT-2 to generate
ration.                                                  text according to the specified SVO triples. We extract
                                                         SVO triples from each story, then add the triples as a
                                                         prefix for each story respectively. This way, the lan-
3. Method                                                guage model can learn from a hint that each story is
                                                         generated conditionally on the SVO triples.
3.1. Overview                                              We use spaCy5 to extract SVO triples from each story
The task here is to generate 5-sentence stories from a as “entities and relations”. However, sometimes the
set of SVO triples that are extracted and regrouped in process may encounter the coreference problem, i.e.,
a knowledge graph. The expected input of the system a pronoun is used as a subject. For example, the sen-
is a set of keywords provided by users. Figure 2 shows tence is “My sister has a dog. She loves him.”, the triple
the two-layer architecture of the DICE system. We directly extracted by spaCy is (My sister, has, dog) and
use SVO triples as an interface to connect the knowl- (She, loves, him), which are not the expected result be-
edge enrichment layer and the text generation layer. cause we want a more specific reference as a subject,
The SVO triples can be constructed from knowledge i.e., (My sister, has, dog) and (My6 sister, loves, dog). The
graphs or extracted from story corpus; meanwhile, they resolution is using neuralcoref that applies the neu-
serve as a prompt for the language model to gener- ral net scoring model to find coreferences in the text
ate stories. The system firstly checks the relationships (Clark & Manning 2016). Meanwhile, to simplify the
between these keywords and adds additional informa- triple, we convert the verb into its lemma and only ex-
tion using the knowledge graph, then generates a set tract the main text of the subject and the object. For
of SVO triples to feed the language model to generate
                                                                  3 https://openai.com/blog/gpt-2-1-5b-release/
stories.                                                          4 https://github.com/minimaxir/gpt-2-simple
   Two processes are involved to complete this task:              5 https://spacy.io/
language model fine-tuning and story generation. Lan-             6 https://spacy.io/universe/project/neuralcoref
the example above, we extract “sister” instead of “my 3.3.1. Knowledge Enrichment
sister”, “love” instead of “loves”.
                                                              The system includes a new knowledge graph dataset
As a result, one example from the encoded dataset is
                                                              named DICE KG. We implemented two versions of DICE
the following:
                                                              KG. Version1 (CW, i.e., ConceptNet and WordNet) com-
       (Joseph, sign, deal), (Joseph, be, musician), (Joseph, bines two large open-source knowledge graphs: Con-
       be, songwriter), (Joseph, hope, write), (Joseph,       ceptNet 5.6.0 and WordNet. ConceptNet is a knowl-
       lose, wallet), (woman, contact, Joseph), (Joseph,      edge graph that connects words and terms (phrases of
       have, idea) The Best Single Joseph has just re-        natural language) with assertions (labeled, weighted
       cently signed a deal with a new record label.          edges) (Speer et al. 2017). Unlike ConceptNet, Word-
       He is a musician and a songwriter who hopes            Net is a large lexical database of English with cognitive
       to write a best new hit. On his way to a local         synonyms (synsets), which are connected by means of
       coffee shop to brainstorm, he lost his wallet.         conceptual-semantic and lexical relations (Miller 1995).
       Joseph was frustrated until a woman contacted
                                                              The DICE KG converts these two datasets into an in-
       him and returned it. Suddenly, he realized he
       had an idea for his new song about kindness.
                                                              tegrated model, and as a result, the dataset contains
                                                              more than 1.6 million nodes and over 3 million rela-
Words in red are the SVO triples; words in orange are tionships with 54 types. The DICE KG is large enough
the story title; words in blue are the story content.         for finding relations between the keywords given by
                                                              users and constructing a set of SVO triples using the
3.2.2. Fine-tuning                                            entities and relations in the knowledge graph. More-
                                                              over, each relationship between the words has an an-
The last step of this process is to fine-tune the model notation named “weight", which can help the system
based on the encoded dataset, which includes both SVO to find a more reasonable path in the next step, i.e., the
triples and the original ROCStories. However, language SVO triple construction.
models like GPT-2 are built for longform content, gen-           We also introduce another version (DBCW, i.e., DB-
erating short text like 5-sentence stories is not the typ- pedia, ConceptNet, and WordNet) of DICE KG that en-
ical generation scenario. To workaround this issue, we riches version1 with DBpedia’s mappings8 . The DBCW
use GPT-2-simple, which allows us to add flags to indi- version includes over 8.5 million nodes with 6 labels
cate where is the start and the end of each short text (5- and over 23 million relationships with 694 types. In
sentence story in this case), then the language model this version, we enrich the common concepts from Con-
will automatically extract the shortform texts during ceptNet and WordNet with factual instances and prop-
the fine-tuning process.                                      erties from DBpedia. In Section 4, we compare the per-
   The final fine-tuned model is called the DICE model, formance of the two versions of DICE KG.
which can be found and downloaded on Google Drive7 .             To construct SVO triples from the given keywords,
                                                              there are 3 steps: internal matching, external enrich-
3.3. Story Generation                                         ment, and converting paths to triples. The internal
                                                              matching concerns finding meaningful relations be-
We use the SVO triples as a prompt for GPT-2. The tween the keywords, so that we can later put the key-
triples are constructed based on the keywords by using words at the corresponding position in an SVO triple.
knowledge graphs . Each triple includes a subject and If there is a keyword that has no relation with other
an object as its entities and a verb as its relation that keywords, we use external enrichment to assign other
connects the entities. The SVO triples can not only related words in the knowledge graph to construct an
give the language model topics (entities) to talk about SVO triple for the keyword. The first two steps are
but also define part of the plots (relations) of the story. both semi-automatic, i.e., we use Cypher to query the
For example, (Jane, be, singer) defines the background graph database and get the matching candidates while
of the story, where there is a person whose name is manually filtering the matching results, which are still
Jane who is a singer. The story generation includes needed to ensure the quality of the SVO triples.
two stages: knowledge enrichment and text genera-                Figure 3 shows an example of the SVO triple con-
tion.                                                         struction. We assume that the keywords are: {love, cat,
                                                              beer, nap}. Firstly, we try to lookup the one-hop rela-
                                                              tionship (only specific relations are considered, such as
    7 https://drive.google.com/drive/folders/                  8 https://databus.dbpedia.org/dbpedia/mappings/

1T68rWkOde5ZwcuodQ9iWuYJAcqAmb0Jo                           mappingbased-objects/2020.07.01
                                                               Table 1
                                                               A story generated by DICE. Words in red are the subjects
                                                               from the SVO triples; words in orange are the verbs (rela-
                                                               tions) from the SVO triples; words in blue are the objects
                                                               from the SVO triples.
                                                                Title      Content
                                                                           Tina loved to sing and drink beer with her
                                                                           friends. One day she was drunk and didn’t
                                                                Lazy       know what to do. She decided to go to the bar
                                                                Cat        and see what she could do. She drank some
                                                                           beer and then went home. She went to sleep
                                                                           and woke up to her cat’s snoring.
Figure 3: An example of SVO triple construction. Words in
yellow are verbs. Words in green are nouns. Words in red
are the enriched words from knowledge graphs.
                                                               4. Experiments
                                                               4.1. Baselines
CapableOf and Desires) between the keywords in the
                                                               DICE (CW) vs. Human. A given keyword set will
knowledge graph. In this case, we find one direct rela-
                                                               be provided to both a human and the DICE system
tion: (cat, desires, nap). Next, we assign additional in-
                                                               (with CW version of knowledge graph) to create sto-
formation to the keywords without a direct relation. In
                                                               ries, then we compare the results of human-written
this case, for the verb “love”, we randomly choose the
                                                               stories and machine-written stories.
word “sing” as the verb’s object, which is connected to
                                                                  DICE (CW) vs. GPT-2. For the original GPT-2
“love” through a relation called “CausesDesire”. Mean-
                                                               model, we construct one or two sentences containing
while, we choose “Tina”, which belongs to the person
                                                               all the entities in the keyword set, and we use these
class, as the verb’s subject. For “beer” which is a noun,
                                                               sentences as input for the GPT-2 model which is di-
we assign it a verb “drink”, which is related to “beer”,
                                                               rectly fine-tuned on ROCStories to generate a story.
and we also choose “Tina” as its subject to keep the
                                                               We then use the same keyword set to generate stories
story simple. Finally, we also need to map the directly
                                                               using the DICE model and compare the results.
one-hop relation into a more common word, for ex-
                                                                  DICE (CW) vs. GPT-2-keyword-generation. GPT-
ample, (cat, desires, nap) becomes (cat, want, nap). As
                                                               2-keyword-generation9 is open-source software that
a result, the final SVO triples are (cat, want, nap), (Tina,
                                                               using GPT-2 to generate text pertaining to the speci-
love, sing), and (Tina, drink, beer).
                                                               fied keywords. We compare the stories directly gener-
                                                               ated from a set of keywords with the stories generated
3.3.2. Text Generation                                         by the DICE system.
The final stage is the text generation. After we get the          DICE-CW vs. DICE-DBCW. We also compare the
SVO triples, we can use these as a prefix to generate          performance of the DICE system when using differ-
stories from the trained model. In this process, we use        ent versions of DICE KG to evaluate whether factual
GPT-2 as the story generator. Meanwhile, we use gpt-           knowledge graphs can contribute to the story genera-
2-simple which allows for prefixes to force the gener-         tion.
ated text to start with the prefix and generate stories
from these triples. Finally, we truncate the prefix and        4.2. Evaluation
flags in the generated stories, to return text only with
titles and contents. Table 1 shows one example gener-          4.2.1. Evaluation Metrics
ated by DICE using the triples mentioned above. These          The evaluation focuses on two aspects of the gener-
stories are handpicked from 75 automatedly generated           ated output: story-independent metrics and story-dependent
stories. We can see the stories can exactly reflect the        metrics (Roemmele et al. 2017). Story-independent
entities and relations from the SVO triples in gener-          metrics, including grammatical correctness, clarity, and
ated stories, although the triples may not be presented        engagement, will be used to analyze the quality of the
in the stories 100% of the time.                               generated output without considering its context; whereas
                                                                  9 github.com/minimaxir/gpt-2-keyword-generation
Table 2                                                     son (non-native English speaker but with professional
Explanations and approaches for each metric. Metrics in     working proficiency) to write two stories with the same
orange are story-independent metrics. Metrics in blue are   keywords. For human-written stories, each story should
story-dependent metrics.                                    only contain 5 sentences and every keyword in the
                                                            keyword set must be mentioned in the story content.
                                             Evaluation     Finally, we invited people to estimate whether the story
 Metrics         Explanation
                                             approach       is written by a human or a machine and score each
 grammatical The correctness of spelling,
                                             Automatic      story on its creativity and coherence.
 correctness grammar and punctuation
             Whether the text is easy to
 clarity                                     Automatic
             understand.                                    5. Results and Discussion
             Whether the writing style
 engagement                                  Automatic
             is interesting and effective.                  5.1. Experiment Results
             Whether the stories are
 creativity                                  Manual        We picked 100 random samples for each model to eval-
             creative or not.
             Semantically coherent of                      uate their performance. We gathered the automatic
 coherence                                    Manual
             the output.                                   evaluation results and manual evaluation results and
             To what extent do the key-                    separated them by story-independent metrics and story-
 Keyword
 coverage
             words are presented in the       Automatic    independent metrics, which were shown in Table 3 and
             generated text.                               Table 4 respectively. The result shows there is no much
                                                           difference according to the story-independent metrics
                                                           among the stories written by the language models and
                                                           human-written stories. The overall grammaticality per-
story-dependent metrics, including coherence, keyword
                                                           formance of each model is satisfactory. The Gram-
coverage, and creativity, will be used to evaluate the
                                                           marly overall score of the fine-tuned GPT-2 model is
generated stories with reference to the context (Roem-
                                                           even higher than the score of human-written stories.
mele et al. 2017). On the other hand, the evaluation
                                                           For samples from ROCStories, most of the grammat-
combines both automatic evaluation and manual eval-
                                                           ical errors are the punctuation misuse. While for the
uation. Explanation of each metric and the evaluation
                                                           stories generated by language models, the biggest writ-
approaches are shown in Table 2.
                                                           ing issue is the determiner (a/an/the/this, etc.) misuse,
Automatic Evaluation. For story-independent met-
                                                           followed by punctuation misuse and wordy sentences.
rics, we used the automated analysis tool, Grammarly,
                                                              For the two story-dependent metrics of creativity
to evaluate the overall grammaticality performance of
                                                           and coherence, all the models perform poorly com-
the generated text. For keyword coverage, we used a
                                                           pared with human writers. In general, the generated
script to monitor to what extent do the keywords were
                                                           stories are not always logical and making sense, even
presented in the generated stories.
                                                           with a properly trained model. The OpenAI team shows
   Manual Evaluation. Stories should be reasonable
                                                           that it takes a few tries to get a good and reasonable
and coherent with the context (Guan et al. 2019), which
                                                           result, and meanwhile, the number of tries is highly
is hard to access by automatic tools. As a result, a man-
                                                           dependent on the topics presented in the training data.
ual evaluation was also performed to more accurately
                                                           Particularly, in this case, the given keywords can influ-
evaluate the quality of each story. We invited 3 indi-
                                                           ence the performance of the result significantly. For
viduals to score the stories from each model, includ-
                                                           example, if the given keywords are barely related to
ing stories from the original ROCStories. We applied
                                                           each other, then the model can perform poorly. This is
5-point Likert scales to rate each story on its creativity
                                                           because unrelated keywords make it more difficult to
and coherence. Then we calculated the overall average
                                                           generate related SVO triples, and unrelated SVO triples
score for each model.
                                            10             lead to unconnected sentences in the generated sto-
   Furthermore, we used a questionnaire to investi-
                                                           ries. However, the keyword coverage of the DICE sys-
gate whether readers could tell the difference between
                                                           tem (96% for DICE-CW and 97% for DICE-DBCW) is
the automatically generated stories and the human-
                                                           significantly higher than other baselines (73% for GPT-
written ones. We handpicked two stories generated
                                                           2, 88% for GPT-keyword-generation). However, for
by the DICE system where the stories were generated
                                                           the DICE-DBCW, the coverage of the enriched words
based on a given keyword set. Then we invited a per-
                                                           (80%) from DBpedia is lower compared with the key-
                                                           word coverage. This is because some of the enriched
   10 https://forms.gle/jEu1LohH5zkADiNt6
words are proper nouns, like brand names, which are Table 5
hardly shown in the training text.                  Result of the questionnaire.
                                                                                   Average
                                                                                                 Vote    Vote
Table 3                                                          Story     Written Score
                                                                                                 for     for
Results of the story-independent metrics.                        No.       by      rate by
                                                                                                 Machine Human
                                                                                   human
  Model        Correctness Clarity Engage Score                  Story1    Human 2.70            70.4%      29.6%
               21 alerts/ Very                                   Story2    Machine 2.80          70.4%      29.6%
  GPT-2                            Engaging 82/100               Story3    Human 3.37            31.5%      68.5%
               4276 words  clear
  GPT-                                                           Story4    Machine 2.74          81.5%      18.5%
               26 alerts/     Mostly
  keyword-                             Bland          78/100
               3512 words     clear
  generation
  DICE-        18 alerts/     Mostly                           5.2. Injecting Relations into Stories
                                       Bland          75/100
  CW           3931 words     clear
                                                               As mentioned in the last section, the keyword cover-
  DICE-        31 alerts/     very     A    bit
  DBCW         5279 words     clear    bland
                                                      80/100   age (96%) and the relation coverage (100%) of the DICE
               54 alerts/     Very     A    bit                system are very high during the test. This means the
  Human
               4591 words     clear    bland
                                                      80/100   SVO triples can effectively affect the plots of the gen-
                                                               erated stories. During the experiment, we find that
                                                               we can use SVO triples to inject entities and the re-
                                                               lations between the entities into the stories as back-
Table 4                                                        grounds or plots. As a result, the quality of the SVO
Results of story-dependent metrics.                            triples and the order of these triples can significantly
                                                               affect the quality of the automatically generated sto-
  Model          Creativity    Coherence
                                               Keyword         ries. Since these triples are generated from the knowl-
                                               coverage        edge graphs, the logic and relationships behind these
  GPT-2          2.3/5         2.4/5           0.7275          knowledge graphs are also important to a better story
  GPT-                                                         generation.
  keyword-       2.4/5         2.7/5           0.88
  generation
  DICE-CW        2.2/5         2.5/5           0.9625          5.3. Quality of Generated Stories
  DICE-
                 2.3/5         2.7/5           0.9725          As shown in Table 4, there is little difference in the
  DBCW
                                                               creativity score and the coherence score from the base-
  Human          3.7/5         4.9/5           N/A
                                                               lines to the DICE model. Although with the DICE model,
                                                               we are able to inject relations into the stories, the re-
                                                               lation can only affect the logic within each sentence
                                                               while it cannot influence the logic that runs through
5.1.1. Questionnaire Results
                                                               the story. This is because the SVO triples extracted
The questionnaire has received 54 responses. Most of           during the language fine-tuning process, are extracted
the respondents are native English speakers (4/5 of the        from each sentence separately in the stories which are
respondents), and some of them are non-native speak-           loosely connected, so they cannot reflect relations like
ers (1/5 of the respondents) but with effective English        causation throughout the text. As a result, the coher-
proficiency. The result is shown in Table 5. In general,       ence of the generated stories from the DICE model is
there is a great chance (37.5% on average) for people          not satisfying in general.
to make a mistake when judging whether the story is
written by a human or a machine. In particular, sto-           5.4. Commonsense vs. Factual KG
ries with short sentences and wrong word choices are
more likely to be regarded as a machine-written story.         We introduce two knowledge graphs in this research.
On the other hand, for stories that are interesting and        The knowledge graph used in version1 (CW) is a se-
creative but without coherence between the sentences,          mantic knowledge graph where common concepts and
people are more likely to make a mistake and think the         words have many connections with each other, which
stories are written by a human.                                is the foundation to relate keywords and construct SVO
                                                               triples. While for the fact-based knowledge graphs
like DBpedia, they can hardly provide connections be-         commonsense knowledge. In Proceedings of the
tween the common concepts, and as a result, they can          AAAI Conference on Artificial Intelligence (Vol. 33,
hardly contribute to the triple construction process.         pp. 6473-6480).
However, with a combination of semantic knowledge [7] Guo, Z., Yi, X., Sun, M., Li, W., Yang, C., Liang, J.,
graphs and factual knowledge graphs, i.e., DICE KG            ... & Li, R. (2019, July). Jiuge: A Human-Machine
version2 (DBCW), we can make use of the knowledge             Collaborative Chinese Classical Poetry Generation
about the instances of the concepts and the properties        System. In Proceedings of the 57th Annual Meeting
of the instances from factual knowledge graphs, and           of the Association for Computational Linguistics:
we can use it to enrich the entities in the triples.          System Demonstrations (pp. 25-30).
                                                           [8] Hsu, C. C., Chen, Z. Y., Hsu, C. Y., Li, C.
                                                              C., Lin, T. Y., Huang, T. H. K., & Ku, L. W.
6. Conclusions                                                (2019). Knowledge-Enriched Visual Storytelling.
                                                              arXiv preprint arXiv:1912.01496.
In this paper we showed how to use subject-verb-object
                                                           [9] Jain, P., Agrawal, P., Mishra, A., Sukhwani, M.,
triples as a context clues input to the generative model,
                                                              Laha, A., & Sankaranarayanan, K. (2017). Story gen-
to connect language models and knowledge graphs for
                                                              eration from sequence of independent short de-
story generation. Evaluation results showed that we
                                                              scriptions. arXiv preprint arXiv:1707.05501.
can effectively inject entities and relations from knowl-
                                                           [10] Keskar, N. S., McCann, B., Varshney, L. R., Xiong,
edge graphs into the generated stories. Future work
                                                              C., & Socher, R. (2019). Ctrl: A conditional trans-
will focus on improving the coherence of the gener-
                                                              former language model for controllable generation.
ated stories and making them have smooth transitions
                                                              arXiv preprint arXiv:1909.05858.
between sentences. For example, in order to improve
                                                           [11] Knublauch, H., & Kontokostas, D. (2017). Shapes
the performance of the internal matching process, we
                                                              constraint language (SHACL). W3C Candidate Rec-
can classify popular words into specific classes and use
                                                              ommendation, 11(8).
ontology techniques, such as SCHACL (Knublauch &
                                                           [12] Koncel-Kedziorski, R., Bekal, D., Luan, Y., Lap-
Kontokostas 2017) and OWL restrictions (McGuinness
                                                              ata, M., & Hajishirzi, H. (2019). Text Generation
& Van Harmelen 2004), to make sure these classes can
                                                              from Knowledge Graphs with Graph Transformers.
interact with each other based on specific rules.
                                                              arXiv preprint arXiv:1904.02342.
                                                           [13] Li, B., Lee-Urban, S., Johnston, G., & Riedl, M.
References                                                    (2013, June). Story generation with crowdsourced
                                                              plot graphs. In Twenty-Seventh AAAI Conference
[1] Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Ka-       on Artificial Intelligence.
    plan, J., Dhariwal, P., ... & Agarwal, S. (2020). Lan- [14]   Liu, W., Zhou, P., Zhao, Z., Wang, Z., Ju, Q.,
    guage models are few-shot learners. arXiv preprint        Deng,   H., & Wang, P. (2019). K-bert: Enabling lan-
    arXiv:2005.14165.                                         guage   representation  with knowledge graph. arXiv
[2] Chen, J., Chen, J., & Yu, Z. (2019, July). Incorpo-       preprint  arXiv:1909.07606.
    rating structured commonsense knowledge in story [15] Logan, R., Liu, N. F., Peters, M. E., Gardner, M., &
    completion. In Proceedings of the AAAI Confer-            Singh, S. (2019, July). Barack’s wife hillary: Using
    ence on Artificial Intelligence (Vol. 33, pp. 6244-       knowledge graphs for fact-aware language model-
    6251).                                                    ing. In Proceedings of the 57th Annual Meeting of
[3] Chen, Z., Eavani, H., Liu, Y., & Wang, W. Y. (2019).      the Association for Computational Linguistics (pp.
    Few-shot NLG with Pre-trained Language Model.             5962-5971).
    arXiv preprint arXiv:1904.09521.                       [16] McGuinness, D. L., & Van Harmelen, F. (2004).
[4] Clark, K., & Manning, C. D. (2016). Deep rein-            OWL web ontology language overview. W3C rec-
    forcement learning for mention-ranking corefer-           ommendation, 10(10), 2004.
    ence models. arXiv preprint arXiv:1609.08667.          [17] Miller, G. A. (1995). WordNet: a lexical database
[5] Devlin, J., Chang, M. W., Lee, K., & Toutanova,           for English. Communications of the ACM, 38(11),
    K. (2018). Bert: Pre-training of deep bidirectional       39-41.
    transformers for language understanding. arXiv [18] Mostafazadeh, N., Vanderwende, L., Yih, W. T.,
    preprint arXiv:1810.04805.                                Kohli, P., & Allen, J. (2016, August). Story cloze
[6] Guan, J., Wang, Y., & Huang, M. (2019, July). Story       evaluator: Vector space representation evaluation
    ending generation with incremental encoding and           by predicting what happens next. In Proceedings of
                                                              the 1st Workshop on Evaluating Vector-Space Rep-
   resentations for NLP (pp. 24-29).
[19] Ostendorff, M., Bourgonje, P., Berger, M.,
   Moreno-Schneider, J., Rehm, G., & Gipp, B. (2019).
   Enriching BERT with Knowledge Graph Embed-
   dings for Document Classification. arXiv preprint
   arXiv:1909.08402.
[20] Radford, A., Wu, J., Child, R., Luan, D., Amodei,
   D., & Sutskever, I. (2019). Language models are un-
   supervised multitask learners. OpenAI Blog, 1(8).
[21] Roemmele, M., Gordon, A. S., & Swanson,
   R. (2017, August). Evaluating story generation
   systems using automated linguistic analyses. In
   SIGKDD 2017 Workshop on Machine Learning for
   Creativity (pp. 13-17).
[22] Speer, R., Chin, J., & Havasi, C. (2017, February).
   Conceptnet 5.5: An open multilingual graph of gen-
   eral knowledge. In Thirty-First AAAI Conference
   on Artificial Intelligence.
[23] Sternberg, R. J. (Ed.). (1999). Handbook of creativ-
   ity. Cambridge University Press.
[24] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J.,
   Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017).
   Attention is all you need. In Advances in neural in-
   formation processing systems (pp. 5998-6008).
[25] Wu, T., Khan, A., Gao, H., & Li, C. (2019). Ef-
   ficiently embedding dynamic knowledge graphs.
   arXiv preprint arXiv:1910.06708.
[26] Yao, L., Mao, C., & Luo, Y. (2019). KG-BERT:
   BERT for knowledge graph completion. arXiv
   preprint arXiv:1909.03193.
[27] Zhou, L., Gao, J., Li, D., & Shum, H. Y. (2020).
   The design and implementation of xiaoice, an em-
   pathetic social chatbot. Computational Linguistics,
   46(1), 53-93.