Creativity Embedding: a vector to characterise and classify plausible triples in deep learning NLP models Isabeau Oliveri Luca Ardito Giuseppe Rizzo Maurizio Morisio Politecnico di Torino Politecnico di Torino LINKS Foundation Politecnico di Torino isabeau.oliveri@ luca.ardito@ giuseppe.rizzo@ maurizio.morisio@ polito.it polito.it linksfoundation.com polito.it Abstract Douglas Adams St John’s College English. In this paper we define the cre- educated at ativity embedding of a text based on four Head Tail self-assessment creativity metrics, namely Relation diversity, novelty, serendipity and magni- tude, knowledge graphs, and neural net- Figure 1: The triple (Douglas Adams, educated at, works. We use as basic unit the notion St John’s College), from Wikidata knowledge base of triple (head, relation, tail). We inves- (Vrandečić and Krötzsch, 2014), is an example of tigate if additional information about cre- statement. ativity improves natural language process- ing tasks. In this work, we focus on triple plausibility task, exploiting BERT model tems for representing them. The most promi- and a WordNet11 dataset sample. Con- nent example is the Semantic Web (Berners-Lee trary to our hypothesis, we do not detect et al., 2001), where the information is represented increase in the performance. through linked statements, each one composed Keywords - Creativity Embedding; Cre- of head,relation,tail, forming a triple (Figure 1). ativity Metric; NLP; Creativity Evalua- This semantic embedding allows significant ad- tion; Triple; Knowledge Graph; BERT. vantages such as reasoning over data and operat- ing with heterogeneous data sources. Integration of structured information is not the 1 Introduction only method that literature provides us to improve Current conversational agents have emerged as NLP techniques. Previous researches pointed out powerful instruments for assisting humans. Often- that analysis of creativity features could improve times, their cores are represented by natural lan- self-assessment evaluation, with benefits for solu- guage processing (NLP) models and algorithms. tions generated and inputs understanding (Lamb However, these models are far from being exhaus- et al., 2018; Karampiperis et al., 2014; Sur- tive representation of reality and language dynam- deanu et al., 2008). We specify that in this ics, trained on biased data through deep learning work creativity is intended as capability to cre- algorithms, where the flow among various layers ate, understand and evaluate novel contents. The without could result in information loss (Wang et concepts of Creativity AI have been discussed al., 2015). As a consequence, NLP techniques still in their interconnections with the Semantic Web find it challenging to manage conversation that (Ławrynowicz, 2020), generalizable to knowledge they have never encountered before, reacting not graphs. Kuznetsova et al. (Kuznetsova et al., efficiently to novel scenarios. 2013) define quantitative measures of creativity One way to mitigate these issues is the inte- in lexical compositions, exploring different the- gration of structured information, which knowl- ories, such as divergent thinking, compositional edge graphs are one of the best-known sys- structure and creative semantic subspace. The cru- cial point is that no every novel combinations are Copyright ©2020 for this paper by its authors. Use per- mitted under Creative Commons License Attribution 4.0 In- perceived creative and useful, distinguishing cre- ternational (CC BY 4.0). ativity perceived in unconventional, uncommon or ”expressive in an interesting, imaginative, or in- "What is the color of the desk?" spirational way”. Despite it is made clear the interest of the scien- tific community in exploring this direction, little research is conducted over creativity in the NLP desk color grey field. The results and the considerations made by p1:0.1 p2:0.5 p3:0.2 p4: ... Kuznetsova and Ławrynowicz, led us to investi- color gate the possible correlations between improve- desk mouse p1:0.9 p2:0.5 p3:0.1 p4: ... ments in NLP tasks and creativity, with a partic- Person ular focus on self-assessment. In this paper we color desk mask introduce a novel approach for supporting deep p1:0.2 p2:0.6 p3:0.3 p4: ... learning algorithms with a mathematical represen- tation of creativity feature of a text. We named Person Possible Knowledge it creativity embedding and based it on metrics Solutions and Context of self-evaluation creativity over graph knowledge base. Figure 2: A person produces different solutions to answer a question. Therefore he performs a self- 2 Approach assessment procedure, taking into account several parameters p based on its knowledge and the con- 2.1 Self-assessment creativity metrics text. Finally, he chooses the possible best solution. When humans face a problem they never en- Parameters are expressed as numbers, for simplic- countered before, they usually perform a self- ity. assessment procedure respect their previous knowledge and context, generally voting for the best solution. Following the example reported in between them. In the literature, there is no fixed Figure 2, we can imagine that a person has to de- notion of similarity. However, a common strategy scribe the colour of a grey desk. He does not for texts is transforming words and sentences remind the name of the colour at that time, and in vectors, taking in account and keeping their performs a creative process. He use a metaphor distributional properties and connections. Sub- to describe the grey colour of the desk, refer- sequently, mathematical distance functions are ring to the stereotype colour of a ”mouse”. This applied. The similarity function could defines a metaphor is widely accepted, and the colour would semantic similarly function between two items be ideally understand by the interlocutor. If in (words or sentences) under these conditions. For place of ”mouse” the random term ”mask” is prompt understanding, we anticipate that in our used, the meaning will not probably received if experiment we use cosine similarity function and not particular context or knowledge is shared be- BERT vectors (embeddings) as words represen- tween the person and the interlocutor, resulting tation, as will be discussed in following sections. in a not effective creative process. To emulate Nevertheless, thus defined metrics could be com- this self-assessment procedure, we propose met- puted with different item vector representation rics inspired by the related-concept literature, such and similarity function, as long as it is adopted a as recommender systems (Monti et al., 2019) and similarity function with output domain [0,1], with machine learning (Pimentel et al., 2014; Ruan high value for high similarity. et al., 2020). The knowledge is represented by a graph of items interconnected by their relation Diversity (1) represents the semantic diversity (triples). between the head hT and tail tT of the triple T . This information tells how these two elements are We define four metrics, namely diversity (1), not semantically close. It could be considered as novelty (2), serendipity (3), and magnitude (4). T internal semantic diversity. In these metrics we make use of a similarity function. In fact, to define the similarity (or div(T ) = 1 − similarity(hT , tT ) (1) the diversity, from another angle) between two or more items, we need a method and a rep- Novelty (2) of a triple T is its average seman- resentation that allows us to define a distance tic diversity respect others triples in the context. Context C is the sub-graph of triple obtained by hidden layer. This creativity embedding can be traversing the paths of length p in the knowledge added and adapted in its dimension. Stated the graph, starting from the triple hT under examina- above concepts, we define the subsequent research tion, collecting n nearest triples. It could be con- questions. sidered as external semantic diversity of T respect to the context C retrieved. Research Question: A creativity embedding n extracted from the creativity neural network could 1X improve triple plausibility classification in deep nov(T ) = 1 − similarity(T, Ci ) (2) n i=1 learning models? Serendipity (3) is here intended as the semantic 3 Model Architecture novelty of the triple T , taking into account the s most novel triples considering the knowledge 3.1 BERT graph (refined context S). It could be considered We select Bidirectional Encoder Representations as T novelty relevance. from Transformers (BERT) (Devlin et al., 2019) as s a model for investigating the effects of creativity 1X ser(T ) = 1 − similarity(T, Si ) (3) embedding, due to its flexibility and modularity, as s i=1 well as being state of the art for various NLP tasks. The BERT model could be divided into three main Magnitude (4) outlines the rarity of the triple, parts: preprocessing of the input, stack of trans- ranking rk each component of the triple by the former layers, and other layers on top to perform number of its occurrences over the total number of a particular task - typically a classifier. A stack items in the knowledge graph. The ranking func- of Transformers forms the BERT core. A trans- tion thus defined has an output domain [0,1]. former exploits the attention mechanism to learn rk(hT ) + rk(relT ) + rk(tT ) the contextual relationship between sentences and mag(T ) = (4) words input. The input is not considered in one 3 direction, but figuratively in all ones at one time, 2.2 Creativity Embedding defining the context of a word considering the en- There were no annotated datasets on the creativity tire surrounding words. The model is trained with characteristics of interest. For this reason, a direct a sort of play, where some words or entire sen- comparison with the ground truth was hampered. tences are masked, and the model has to predict To overcome this obstacle, we indirectly measured them. We do not modify the core of the model; the effectiveness of this approach by applying it we are more interested in the preprocessing part, to an external model and judging the results on where we will inject the creativity embedding, as the triple plausibility task (Yao et al., 2019; Wang explained in the next section. et al., 2018; Wang et al., 2015; Padó et al., 2009). The triple plausibility task consists of classifying 3.2 Creativity Neural Network and a dataset’s triples in plausible or not plausible Creativity CLS Embedding classes, comparing the result respect to the ground The outline of the architecture proposed for the truth. We choose this task to perform an indirect task is shown in Figure 3. In the lower part, evaluation of our proposal, rely on the correlation the triple flows through the BERT model. We between plausibility and creativity (Lamb et al., used a modified tokenization technique of Knowl- 2018), as plausibility could represent a positive edge Graph BERT (KG-BERT) (Yao et al., 2019), outcome of an effective creative process. The adapted for the structure of the triple. The triple current trend in machine learning and natural is split in tokens respect the BERT vocabulary language processing models pushes the use of of known words. Special tokens are included in mathematical representation of meaningful infor- the sequence, classification (CLS) and separator mation utilising vectors, commonly known in this (SEP) tokens. CLS corresponding embeddings are field as embeddings. For these reasons, we outline in charge of representing the sentence mathemat- and train a neural network using the computed ically, and SEP tokens that separate different sen- ground truth to predict creativity values, and tences. On the KG-BERT version for triple plau- define as creativity embedding the weight of last sibility, SEP is used to separate head words from Creativity Neural Network Fully Connected + Dropout Input Hidden Hidden Hidden Hidden Output Layer Layer Layer Layer Layer Layer (2304) (2048, (2048, (1024, (768, (4) ReLU) ReLU) ReLu) ReLu) Embt 1 Embt c + + div ... 768 * 3 = 2304 nov Embr 1 Embr b ... ... + + ... ... ... ... ser a Embh 1 Embh + + mag ... < 0, ... > < 0, ... > < -0.96, < 0.78, SEP ... > 102 adding to CLS creativity embedding Tokt c ... > 236 word token strings to word ids word ids to word embeddings ( Vaswani, Ashish, et al., 2017, Devlin, Jacob, et al., 2019) ... ... ... ... < 0, ... > < 0, ... > < 0, ... > < 1.77, 18956 Tokt 1 BERT tokenization ... > Transformer Attention Mechanism < -2.36, < 0.78, Triple Plausibity Classificator SEP ... > 102 L [0, 1] Is the triple plausible? Tokr b ... > tail 56 (No/Yes) Input Triple rel ... ... ... ... < 0, ... > < 0, ... > < 0, ... > < 0.65, 1 head ... > 455 Tokr 0 < 0.78, SEP ... > 102 word embeddings < 6.36, a ... > Tokh 96 word ids 12 ... ... ... < 0, ... > < 0.02, < -1.25, 1 1290 ... > Tokh 30K < 3.26, CLS ... > ... > 101 0 768 Tokenizer KG-BERT BERT Word Embedding Lookup Creativity Embedding Table Figure 3: For each triple, Creativity Embedding computed by Creativity Neural Network is added to BERT CLS embedding, defining the Creativity CLS Embedding. A linear classifier on top perform the triple plausibility classification. relation and tail words in three different sentences. providing the model with a non-empty CLS, Cre- The corresponding token identifiers and embed- ativity CLS Embedding. In this case, the penul- dings are retrieved through two lookup tables, pro- timate layer has been described with several neu- vided by the BERT model. At the top of Figure 3, rons equal to 768, the same size as the BERT em- we show our creativity neural network. A com- beddings. On the top of the architecture, a linear pact and fixed-size version of the embeddings is classifier is in charge of predictions of the plausi- obtained from BERT, summing the embeddings of bility task relying on Creativity CLS Embedding. each component of the triple. This compact ver- sion feeds the proposed neural network in charge 4 Experiment of predicting creativity’s four values and produc- ing creativity embedding. The neural network In this experiment we random sample triples consists of an input layer (768 ∗ 3 neurons), an from WordNet11 (Miller, 1995) dataset (50000 output layer (4 neurons), 4 fully connected hidden train, 5000 validation, 3000 test, with positive and layers with a dropout probability = 0.5. The acti- negative labels balanced). vation function used is ReLU . This neural net- work structure is basic since its main task is to Creativity Neural Network. As stated in the have a flexible last hidden layer adaptable to the previous sections, we compute the four metrics technology that would leverage the creativity em- on each triple dataset to create the ground truth. bedding. The CLS token is one of the most repre- As a similarity function we use cosine similarity, sentative tokens to perform classification and other that returns a value between 0 and 1, with high types of predictions. Came to us exploiting CLS value for high similarity. We applied the cosine token to adding creative embedding of the triple, similarity function after transforming words and sentences in embeddings, provided by BERT model. We encountered slowdowns only with 5 Result and Conclusion novelty metric. The number of nodes is not In this paper we investigate if defined creativity predictable a priori in our setting, and the mathe- embedding improves triple plausibility task, ex- matical nature of the formula is sensitive to a high ploiting BERT model. We do not detect an in- number of nodes. Peaks of memory allocation crease in the performance (Table 1), comparing could occur, as well as long computation time. ourselves to KG-BERT results. In this compari- We limit the failure due to out of memory or son we should point out that the sample used is timeout of the scheduled jobs applying the ”divide one fifth of the complete WN11 dataset. This re- et impera” paradigm and other adjustments. The sult is somewhat contrary to our expectations, as length of the path p, seen as recursion deep, is the creativity embeddings represent in some way a fixed to 5. For each node interested by recursion, priori information. A possible explanation might the number of maximum neighbor nodes n be the learning methodology of the creativity em- considered is fixed to 20. Once we obtain all the bedding: we suppose that a significant loss of in- metrics values, we can train the Creativity Neural formation in the process has occurred. Further re- Network, as a regression problem. We use: as loss search might explore other types of embeddings criterion mean squared error loss; as optimizer (Grohe, 2020), as graph2vec, and different inte- AdamW with learning rate = 0.001, betas = gration of the proposed metrics. Future experi- (0.9, 0.999), epsilon = 1e−08 , weight decay = mental investigations may try different parameter 0.01; as scheduler StepLR with parameters step configurations. For example, the number of nodes size = 10 and gamma = 0.1; we train the model considered intuitively could change the values of for 10 epochs, size batch of 512. To evaluate metrics as a novelty. Nevertheless, more in-depth performance on test set we compute explained data analysis on the used dataset, corresponding variance score = −0.4493, mean absolute error knowledge graph, and data correlations could pro- = 0.1733 , mean squared error = 0.0388 and R2 vide additional insights. In future work, we will score = −6.7694. Although small values of mean consider different combinations of metrics defined squared and absolute error, R2 tells us that the to train the creativity neural network. It is possi- model do not approximate the distribution better ble that there are metrics more or not relevant for than the ”best-fit” line. This is probably due to the task. Selecting metrics strictly relevant will low entropy of the inputted metrics values, that result in a lightening of the computational effort inspected, result in stationing around 0.5 value. and will give us information about correlations be- tween metrics and results. To conclude, we aim to Triple Plausibility Task. The tokenized triple bring the NLP community’s attention to new re- is inputted to the Creativity Neural Network, ob- search topics on creativity. taining the creativity embeddings. This is added Acknowledgments to the CLS embedding token, and the triple flows through the Transformers stack. Therefore, the Computational resources provided by BERT model is used to make predictions and ad- HPC@POLITO, which is a project of Aca- dress the triple plausibility task, putting a linear demic Computing within the Department of classifier on top of the Transformer stack. We Control and Computer Engineering at the Politec- use as loss function the binary cross-entropy loss nico di Torino2 . We thank the reviewers from function. The literature suggests few epochs and CLiC-it 2020 conference for the comments and samples for the finetuning process. We finetune advices. BERT for 2 epochs; after we freeze the weights of the model, training only the classifier layer for 3 References epochs. We select BERT base uncased as baseline model; as optimizer AdamW with learning rate = Tim Berners-Lee, James Hendler, and Ora Lassila. 5e−05 , as scheduler a linear scheduler with warm 2001. The semantic web. Scientific american, 284(5):34–43. up proportion = 10%; for the classifier dropout probability = 0.5. We fix the maximum sequence Jacob Devlin, Ming-Wei Chang, Kenton Lee, and length at 100 tokens, as all the triples after tok- Kristina Toutanova. 2019. Bert: Pre-training of 2 enization do not exceed this number of tokens. http://www.hpc.polito.it Number of triples Model Metrics Train Val Test Accuracy Recall Precision F1 CE+BERT 50000 3000 5000 0.5093 0.8510 0.5102 0.6379 KG-BERT 225162 5218 21088 0.9334 0.9345 0.9324 0.9334 Table 1: Triple plausibility experiment results. deep bidirectional transformers for language under- Yu-Ping Ruan, Zhen-Hua Ling, Xiaodan Zhu, Quan standing. In Proceedings of the 2019 Conference of Liu, and Jia-Chen Gu. 2020. Generating diverse the North American Chapter of the Association for conversation responses by creating and ranking mul- Computational Linguistics: Human Language Tech- tiple candidates. Computer Speech Language, nologies, Volume 1 (Long and Short Papers), pages 62:101071. 4171–4186. Mihai Surdeanu, Massimiliano Ciaramita, and Hugo Martin Grohe. 2020. Word2vec, node2vec, graph2vec, Zaragoza. 2008. Learning to rank answers on large x2vec: Towards a theory of vector embeddings of online qa collections. In Proceedings of ACL-08: structured data. In Proceedings of the 39th ACM HLT, pages 719–727. SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS’20, page 1–16, New Denny Vrandečić and Markus Krötzsch. 2014. Wiki- York, NY, USA. Association for Computing Ma- data: A free collaborative knowledgebase. Com- chinery. mun. ACM, 57(10):78–85, September. Quan Wang, Bin Wang, and Li Guo. 2015. Knowl- P. Karampiperis, A. Koukourikos, and E. Koliopoulou. edge base completion using embeddings and rules. 2014. Towards machines for measuring creativity: IJCAI’15, page 1859–1865. AAAI Press. The use of computational tools in storytelling activi- ties. In 2014 IEEE 14th International Conference on Su Wang, Greg Durrett, and Katrin Erk. 2018. Model- Advanced Learning Technologies, pages 508–512. ing semantic plausibility by injecting world knowl- edge. In Proceedings of the 2018 Conference of Polina Kuznetsova, Jianfu Chen, and Yejin Choi. 2013. the North American Chapter of the Association Understanding and quantifying creativity in lexical for Computational Linguistics: Human Language composition. In Proceedings of the 2013 Confer- Technologies, Volume 2 (Short Papers), pages 303– ence on Empirical Methods in Natural Language 308, New Orleans, Louisiana, June. Association for Processing, pages 1246–1258, Seattle, Washington, Computational Linguistics. USA, October. Association for Computational Lin- guistics. Liang Yao, Chengsheng Mao, and Yuan Luo. 2019. Kg-bert: Bert for knowledge graph completion. Carolyn Lamb, Daniel G. Brown, and Charles L. A. arXiv preprint arXiv:1909.03193. Clarke. 2018. Evaluating computational creativity: An interdisciplinary tutorial. ACM Comput. Surv., 51(2), February. Agnieszka Ławrynowicz. 2020. Creative ai: A new avenue for the semantic web? Semantic Web, pages 69–78. George A Miller. 1995. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39–41. Diego Monti, Enrico Palumbo, Giuseppe Rizzo, and Maurizio Morisio. 2019. Sequeval: An offline eval- uation framework for sequence-based recommender systems. Information, 10(5):174. Ulrike Padó, Matthew W Crocker, and Frank Keller. 2009. A probabilistic model of semantic plausi- bility in sentence processing. Cognitive Science, 33(5):794–838. Marco A.F. Pimentel, David A. Clifton, Lei Clifton, and Lionel Tarassenko. 2014. A review of novelty detection. Signal Processing, 99:215 – 249.