=Paper=
{{Paper
|id=Vol-2769/64
|storemode=property
|title=Creativity Embedding: A Vector to Characterise and Classify Plausible Triples in Deep Learning NLP Models
|pdfUrl=https://ceur-ws.org/Vol-2769/paper_64.pdf
|volume=Vol-2769
|authors=Isabeau Oliveri,Luca Ardito,Giuseppe Rizzo,Maurizio Morisio
|dblpUrl=https://dblp.org/rec/conf/clic-it/OliveriARM20
}}
==Creativity Embedding: A Vector to Characterise and Classify Plausible Triples in Deep Learning NLP Models==
Creativity Embedding: a vector to characterise and classify plausible
triples in deep learning NLP models
Isabeau Oliveri Luca Ardito Giuseppe Rizzo Maurizio Morisio
Politecnico di Torino Politecnico di Torino LINKS Foundation Politecnico di Torino
isabeau.oliveri@ luca.ardito@ giuseppe.rizzo@ maurizio.morisio@
polito.it polito.it linksfoundation.com polito.it
Abstract Douglas Adams St John’s College
English. In this paper we define the cre- educated at
ativity embedding of a text based on four Head Tail
self-assessment creativity metrics, namely Relation
diversity, novelty, serendipity and magni-
tude, knowledge graphs, and neural net-
Figure 1: The triple (Douglas Adams, educated at,
works. We use as basic unit the notion
St John’s College), from Wikidata knowledge base
of triple (head, relation, tail). We inves-
(Vrandečić and Krötzsch, 2014), is an example of
tigate if additional information about cre-
statement.
ativity improves natural language process-
ing tasks. In this work, we focus on triple
plausibility task, exploiting BERT model tems for representing them. The most promi-
and a WordNet11 dataset sample. Con- nent example is the Semantic Web (Berners-Lee
trary to our hypothesis, we do not detect et al., 2001), where the information is represented
increase in the performance. through linked statements, each one composed
Keywords - Creativity Embedding; Cre- of head,relation,tail, forming a triple (Figure 1).
ativity Metric; NLP; Creativity Evalua- This semantic embedding allows significant ad-
tion; Triple; Knowledge Graph; BERT. vantages such as reasoning over data and operat-
ing with heterogeneous data sources.
Integration of structured information is not the
1 Introduction only method that literature provides us to improve
Current conversational agents have emerged as NLP techniques. Previous researches pointed out
powerful instruments for assisting humans. Often- that analysis of creativity features could improve
times, their cores are represented by natural lan- self-assessment evaluation, with benefits for solu-
guage processing (NLP) models and algorithms. tions generated and inputs understanding (Lamb
However, these models are far from being exhaus- et al., 2018; Karampiperis et al., 2014; Sur-
tive representation of reality and language dynam- deanu et al., 2008). We specify that in this
ics, trained on biased data through deep learning work creativity is intended as capability to cre-
algorithms, where the flow among various layers ate, understand and evaluate novel contents. The
without could result in information loss (Wang et concepts of Creativity AI have been discussed
al., 2015). As a consequence, NLP techniques still in their interconnections with the Semantic Web
find it challenging to manage conversation that (Ławrynowicz, 2020), generalizable to knowledge
they have never encountered before, reacting not graphs. Kuznetsova et al. (Kuznetsova et al.,
efficiently to novel scenarios. 2013) define quantitative measures of creativity
One way to mitigate these issues is the inte- in lexical compositions, exploring different the-
gration of structured information, which knowl- ories, such as divergent thinking, compositional
edge graphs are one of the best-known sys- structure and creative semantic subspace. The cru-
cial point is that no every novel combinations are
Copyright ©2020 for this paper by its authors. Use per-
mitted under Creative Commons License Attribution 4.0 In- perceived creative and useful, distinguishing cre-
ternational (CC BY 4.0). ativity perceived in unconventional, uncommon or
”expressive in an interesting, imaginative, or in- "What is the color
of the desk?"
spirational way”.
Despite it is made clear the interest of the scien-
tific community in exploring this direction, little
research is conducted over creativity in the NLP desk
color
grey
field. The results and the considerations made by p1:0.1 p2:0.5 p3:0.2 p4: ...
Kuznetsova and Ławrynowicz, led us to investi- color
gate the possible correlations between improve- desk mouse
p1:0.9 p2:0.5 p3:0.1 p4: ...
ments in NLP tasks and creativity, with a partic-
Person
ular focus on self-assessment. In this paper we color
desk mask
introduce a novel approach for supporting deep p1:0.2 p2:0.6 p3:0.3 p4: ...
learning algorithms with a mathematical represen-
tation of creativity feature of a text. We named Person
Possible
Knowledge
it creativity embedding and based it on metrics Solutions
and Context
of self-evaluation creativity over graph knowledge
base. Figure 2: A person produces different solutions to
answer a question. Therefore he performs a self-
2 Approach assessment procedure, taking into account several
parameters p based on its knowledge and the con-
2.1 Self-assessment creativity metrics
text. Finally, he chooses the possible best solution.
When humans face a problem they never en- Parameters are expressed as numbers, for simplic-
countered before, they usually perform a self- ity.
assessment procedure respect their previous
knowledge and context, generally voting for the
best solution. Following the example reported in between them. In the literature, there is no fixed
Figure 2, we can imagine that a person has to de- notion of similarity. However, a common strategy
scribe the colour of a grey desk. He does not for texts is transforming words and sentences
remind the name of the colour at that time, and in vectors, taking in account and keeping their
performs a creative process. He use a metaphor distributional properties and connections. Sub-
to describe the grey colour of the desk, refer- sequently, mathematical distance functions are
ring to the stereotype colour of a ”mouse”. This applied. The similarity function could defines a
metaphor is widely accepted, and the colour would semantic similarly function between two items
be ideally understand by the interlocutor. If in (words or sentences) under these conditions. For
place of ”mouse” the random term ”mask” is prompt understanding, we anticipate that in our
used, the meaning will not probably received if experiment we use cosine similarity function and
not particular context or knowledge is shared be- BERT vectors (embeddings) as words represen-
tween the person and the interlocutor, resulting tation, as will be discussed in following sections.
in a not effective creative process. To emulate Nevertheless, thus defined metrics could be com-
this self-assessment procedure, we propose met- puted with different item vector representation
rics inspired by the related-concept literature, such and similarity function, as long as it is adopted a
as recommender systems (Monti et al., 2019) and similarity function with output domain [0,1], with
machine learning (Pimentel et al., 2014; Ruan high value for high similarity.
et al., 2020). The knowledge is represented by
a graph of items interconnected by their relation Diversity (1) represents the semantic diversity
(triples). between the head hT and tail tT of the triple T .
This information tells how these two elements are
We define four metrics, namely diversity (1),
not semantically close. It could be considered as
novelty (2), serendipity (3), and magnitude (4).
T internal semantic diversity.
In these metrics we make use of a similarity
function. In fact, to define the similarity (or div(T ) = 1 − similarity(hT , tT ) (1)
the diversity, from another angle) between two
or more items, we need a method and a rep- Novelty (2) of a triple T is its average seman-
resentation that allows us to define a distance tic diversity respect others triples in the context.
Context C is the sub-graph of triple obtained by hidden layer. This creativity embedding can be
traversing the paths of length p in the knowledge added and adapted in its dimension. Stated the
graph, starting from the triple hT under examina- above concepts, we define the subsequent research
tion, collecting n nearest triples. It could be con- questions.
sidered as external semantic diversity of T respect
to the context C retrieved. Research Question: A creativity embedding
n extracted from the creativity neural network could
1X improve triple plausibility classification in deep
nov(T ) = 1 − similarity(T, Ci ) (2)
n i=1 learning models?
Serendipity (3) is here intended as the semantic 3 Model Architecture
novelty of the triple T , taking into account the
s most novel triples considering the knowledge 3.1 BERT
graph (refined context S). It could be considered We select Bidirectional Encoder Representations
as T novelty relevance. from Transformers (BERT) (Devlin et al., 2019) as
s a model for investigating the effects of creativity
1X
ser(T ) = 1 − similarity(T, Si ) (3) embedding, due to its flexibility and modularity, as
s i=1 well as being state of the art for various NLP tasks.
The BERT model could be divided into three main
Magnitude (4) outlines the rarity of the triple,
parts: preprocessing of the input, stack of trans-
ranking rk each component of the triple by the
former layers, and other layers on top to perform
number of its occurrences over the total number of
a particular task - typically a classifier. A stack
items in the knowledge graph. The ranking func-
of Transformers forms the BERT core. A trans-
tion thus defined has an output domain [0,1].
former exploits the attention mechanism to learn
rk(hT ) + rk(relT ) + rk(tT ) the contextual relationship between sentences and
mag(T ) = (4) words input. The input is not considered in one
3
direction, but figuratively in all ones at one time,
2.2 Creativity Embedding defining the context of a word considering the en-
There were no annotated datasets on the creativity tire surrounding words. The model is trained with
characteristics of interest. For this reason, a direct a sort of play, where some words or entire sen-
comparison with the ground truth was hampered. tences are masked, and the model has to predict
To overcome this obstacle, we indirectly measured them. We do not modify the core of the model;
the effectiveness of this approach by applying it we are more interested in the preprocessing part,
to an external model and judging the results on where we will inject the creativity embedding, as
the triple plausibility task (Yao et al., 2019; Wang explained in the next section.
et al., 2018; Wang et al., 2015; Padó et al., 2009).
The triple plausibility task consists of classifying 3.2 Creativity Neural Network and
a dataset’s triples in plausible or not plausible Creativity CLS Embedding
classes, comparing the result respect to the ground The outline of the architecture proposed for the
truth. We choose this task to perform an indirect task is shown in Figure 3. In the lower part,
evaluation of our proposal, rely on the correlation the triple flows through the BERT model. We
between plausibility and creativity (Lamb et al., used a modified tokenization technique of Knowl-
2018), as plausibility could represent a positive edge Graph BERT (KG-BERT) (Yao et al., 2019),
outcome of an effective creative process. The adapted for the structure of the triple. The triple
current trend in machine learning and natural is split in tokens respect the BERT vocabulary
language processing models pushes the use of of known words. Special tokens are included in
mathematical representation of meaningful infor- the sequence, classification (CLS) and separator
mation utilising vectors, commonly known in this (SEP) tokens. CLS corresponding embeddings are
field as embeddings. For these reasons, we outline in charge of representing the sentence mathemat-
and train a neural network using the computed ically, and SEP tokens that separate different sen-
ground truth to predict creativity values, and tences. On the KG-BERT version for triple plau-
define as creativity embedding the weight of last sibility, SEP is used to separate head words from
Creativity Neural Network
Fully Connected + Dropout
Input Hidden Hidden Hidden Hidden Output
Layer Layer Layer Layer Layer Layer
(2304) (2048, (2048, (1024, (768, (4)
ReLU) ReLU) ReLu) ReLu)
Embt 1
Embt c
+ + div
...
768 * 3 = 2304
nov
Embr 1
Embr b
...
...
+ +
...
...
...
...
ser
a
Embh 1
Embh
+ + mag
...
< 0, ... > < 0, ... >
< -0.96, < 0.78,
SEP
... >
102
adding to CLS creativity embedding
Tokt c
... >
236
word token strings to word ids
word ids to word embeddings
( Vaswani, Ashish, et al., 2017, Devlin, Jacob, et al., 2019)
...
...
...
...
< 0, ... > < 0, ... > < 0, ... >
< 1.77,
18956
Tokt 1
BERT tokenization
... >
Transformer Attention Mechanism
< -2.36, < 0.78,
Triple Plausibity Classificator
SEP
... >
102
L [0, 1] Is
the triple
plausible?
Tokr b
... >
tail
56
(No/Yes)
Input Triple
rel
...
...
...
...
< 0, ... > < 0, ... > < 0, ... >
< 0.65,
1
head
... >
455
Tokr
0
< 0.78,
SEP
... >
102
word embeddings
< 6.36,
a
... >
Tokh
96
word ids
12
...
...
... < 0, ... >
< 0.02, < -1.25,
1
1290
... >
Tokh
30K
< 3.26,
CLS
... >
... >
101
0 768
Tokenizer KG-BERT BERT Word Embedding Lookup Creativity Embedding
Table
Figure 3: For each triple, Creativity Embedding computed by Creativity Neural Network is added to
BERT CLS embedding, defining the Creativity CLS Embedding. A linear classifier on top perform the
triple plausibility classification.
relation and tail words in three different sentences. providing the model with a non-empty CLS, Cre-
The corresponding token identifiers and embed- ativity CLS Embedding. In this case, the penul-
dings are retrieved through two lookup tables, pro- timate layer has been described with several neu-
vided by the BERT model. At the top of Figure 3, rons equal to 768, the same size as the BERT em-
we show our creativity neural network. A com- beddings. On the top of the architecture, a linear
pact and fixed-size version of the embeddings is classifier is in charge of predictions of the plausi-
obtained from BERT, summing the embeddings of bility task relying on Creativity CLS Embedding.
each component of the triple. This compact ver-
sion feeds the proposed neural network in charge 4 Experiment
of predicting creativity’s four values and produc-
ing creativity embedding. The neural network In this experiment we random sample triples
consists of an input layer (768 ∗ 3 neurons), an from WordNet11 (Miller, 1995) dataset (50000
output layer (4 neurons), 4 fully connected hidden train, 5000 validation, 3000 test, with positive and
layers with a dropout probability = 0.5. The acti- negative labels balanced).
vation function used is ReLU . This neural net-
work structure is basic since its main task is to Creativity Neural Network. As stated in the
have a flexible last hidden layer adaptable to the previous sections, we compute the four metrics
technology that would leverage the creativity em- on each triple dataset to create the ground truth.
bedding. The CLS token is one of the most repre- As a similarity function we use cosine similarity,
sentative tokens to perform classification and other that returns a value between 0 and 1, with high
types of predictions. Came to us exploiting CLS value for high similarity. We applied the cosine
token to adding creative embedding of the triple, similarity function after transforming words and
sentences in embeddings, provided by BERT
model. We encountered slowdowns only with 5 Result and Conclusion
novelty metric. The number of nodes is not
In this paper we investigate if defined creativity
predictable a priori in our setting, and the mathe-
embedding improves triple plausibility task, ex-
matical nature of the formula is sensitive to a high
ploiting BERT model. We do not detect an in-
number of nodes. Peaks of memory allocation
crease in the performance (Table 1), comparing
could occur, as well as long computation time.
ourselves to KG-BERT results. In this compari-
We limit the failure due to out of memory or
son we should point out that the sample used is
timeout of the scheduled jobs applying the ”divide
one fifth of the complete WN11 dataset. This re-
et impera” paradigm and other adjustments. The
sult is somewhat contrary to our expectations, as
length of the path p, seen as recursion deep, is
the creativity embeddings represent in some way a
fixed to 5. For each node interested by recursion,
priori information. A possible explanation might
the number of maximum neighbor nodes n
be the learning methodology of the creativity em-
considered is fixed to 20. Once we obtain all the
bedding: we suppose that a significant loss of in-
metrics values, we can train the Creativity Neural
formation in the process has occurred. Further re-
Network, as a regression problem. We use: as loss
search might explore other types of embeddings
criterion mean squared error loss; as optimizer
(Grohe, 2020), as graph2vec, and different inte-
AdamW with learning rate = 0.001, betas =
gration of the proposed metrics. Future experi-
(0.9, 0.999), epsilon = 1e−08 , weight decay =
mental investigations may try different parameter
0.01; as scheduler StepLR with parameters step
configurations. For example, the number of nodes
size = 10 and gamma = 0.1; we train the model
considered intuitively could change the values of
for 10 epochs, size batch of 512. To evaluate
metrics as a novelty. Nevertheless, more in-depth
performance on test set we compute explained
data analysis on the used dataset, corresponding
variance score = −0.4493, mean absolute error
knowledge graph, and data correlations could pro-
= 0.1733 , mean squared error = 0.0388 and R2
vide additional insights. In future work, we will
score = −6.7694. Although small values of mean
consider different combinations of metrics defined
squared and absolute error, R2 tells us that the
to train the creativity neural network. It is possi-
model do not approximate the distribution better
ble that there are metrics more or not relevant for
than the ”best-fit” line. This is probably due to
the task. Selecting metrics strictly relevant will
low entropy of the inputted metrics values, that
result in a lightening of the computational effort
inspected, result in stationing around 0.5 value.
and will give us information about correlations be-
tween metrics and results. To conclude, we aim to
Triple Plausibility Task. The tokenized triple bring the NLP community’s attention to new re-
is inputted to the Creativity Neural Network, ob- search topics on creativity.
taining the creativity embeddings. This is added Acknowledgments
to the CLS embedding token, and the triple flows
through the Transformers stack. Therefore, the Computational resources provided by
BERT model is used to make predictions and ad- HPC@POLITO, which is a project of Aca-
dress the triple plausibility task, putting a linear demic Computing within the Department of
classifier on top of the Transformer stack. We Control and Computer Engineering at the Politec-
use as loss function the binary cross-entropy loss nico di Torino2 . We thank the reviewers from
function. The literature suggests few epochs and CLiC-it 2020 conference for the comments and
samples for the finetuning process. We finetune advices.
BERT for 2 epochs; after we freeze the weights of
the model, training only the classifier layer for 3
References
epochs. We select BERT base uncased as baseline
model; as optimizer AdamW with learning rate = Tim Berners-Lee, James Hendler, and Ora Lassila.
5e−05 , as scheduler a linear scheduler with warm 2001. The semantic web. Scientific american,
284(5):34–43.
up proportion = 10%; for the classifier dropout
probability = 0.5. We fix the maximum sequence Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
length at 100 tokens, as all the triples after tok- Kristina Toutanova. 2019. Bert: Pre-training of
2
enization do not exceed this number of tokens. http://www.hpc.polito.it
Number of triples Model Metrics
Train Val Test Accuracy Recall Precision F1
CE+BERT 50000 3000 5000 0.5093 0.8510 0.5102 0.6379
KG-BERT 225162 5218 21088 0.9334 0.9345 0.9324 0.9334
Table 1: Triple plausibility experiment results.
deep bidirectional transformers for language under- Yu-Ping Ruan, Zhen-Hua Ling, Xiaodan Zhu, Quan
standing. In Proceedings of the 2019 Conference of Liu, and Jia-Chen Gu. 2020. Generating diverse
the North American Chapter of the Association for conversation responses by creating and ranking mul-
Computational Linguistics: Human Language Tech- tiple candidates. Computer Speech Language,
nologies, Volume 1 (Long and Short Papers), pages 62:101071.
4171–4186.
Mihai Surdeanu, Massimiliano Ciaramita, and Hugo
Martin Grohe. 2020. Word2vec, node2vec, graph2vec, Zaragoza. 2008. Learning to rank answers on large
x2vec: Towards a theory of vector embeddings of online qa collections. In Proceedings of ACL-08:
structured data. In Proceedings of the 39th ACM HLT, pages 719–727.
SIGMOD-SIGACT-SIGAI Symposium on Principles
of Database Systems, PODS’20, page 1–16, New Denny Vrandečić and Markus Krötzsch. 2014. Wiki-
York, NY, USA. Association for Computing Ma- data: A free collaborative knowledgebase. Com-
chinery. mun. ACM, 57(10):78–85, September.
Quan Wang, Bin Wang, and Li Guo. 2015. Knowl-
P. Karampiperis, A. Koukourikos, and E. Koliopoulou. edge base completion using embeddings and rules.
2014. Towards machines for measuring creativity: IJCAI’15, page 1859–1865. AAAI Press.
The use of computational tools in storytelling activi-
ties. In 2014 IEEE 14th International Conference on Su Wang, Greg Durrett, and Katrin Erk. 2018. Model-
Advanced Learning Technologies, pages 508–512. ing semantic plausibility by injecting world knowl-
edge. In Proceedings of the 2018 Conference of
Polina Kuznetsova, Jianfu Chen, and Yejin Choi. 2013. the North American Chapter of the Association
Understanding and quantifying creativity in lexical for Computational Linguistics: Human Language
composition. In Proceedings of the 2013 Confer- Technologies, Volume 2 (Short Papers), pages 303–
ence on Empirical Methods in Natural Language 308, New Orleans, Louisiana, June. Association for
Processing, pages 1246–1258, Seattle, Washington, Computational Linguistics.
USA, October. Association for Computational Lin-
guistics. Liang Yao, Chengsheng Mao, and Yuan Luo. 2019.
Kg-bert: Bert for knowledge graph completion.
Carolyn Lamb, Daniel G. Brown, and Charles L. A. arXiv preprint arXiv:1909.03193.
Clarke. 2018. Evaluating computational creativity:
An interdisciplinary tutorial. ACM Comput. Surv.,
51(2), February.
Agnieszka Ławrynowicz. 2020. Creative ai: A new
avenue for the semantic web? Semantic Web, pages
69–78.
George A Miller. 1995. Wordnet: a lexical
database for english. Communications of the ACM,
38(11):39–41.
Diego Monti, Enrico Palumbo, Giuseppe Rizzo, and
Maurizio Morisio. 2019. Sequeval: An offline eval-
uation framework for sequence-based recommender
systems. Information, 10(5):174.
Ulrike Padó, Matthew W Crocker, and Frank Keller.
2009. A probabilistic model of semantic plausi-
bility in sentence processing. Cognitive Science,
33(5):794–838.
Marco A.F. Pimentel, David A. Clifton, Lei Clifton,
and Lionel Tarassenko. 2014. A review of novelty
detection. Signal Processing, 99:215 – 249.