Knowledge Infused Learning (K-IL): Towards Deep Incorporation of Knowledge in Deep Learning Ugur Kursuncu∗ , Manas Gaur∗ , Amit Sheth AI Institute, University of South Carolina Columbia, SC, USA {kursuncu@mailbox.sc.edu, mgaur@email.sc.edu, amit@sc.edu} Abstract challenges in the translation of research methods and re- Learning the underlying patterns in data goes beyond sources into practice often draw from a class of rarely stud- instance-based generalization to external knowledge repre- ied problems that do not yield to contemporary bottom-up sented in structured graphs or networks. Deep learning that ML methods. Policymakers and practitioners assert serious primarily constitutes neural computing stream in AI has usability concerns that constrain adoption, notably in high- shown significant advances in probabilistically learning la- consequence domains (Topol 2019). In most cases, data- tent patterns using a multi-layered network of computational dependent ML algorithms require high computing power nodes (i.e., neurons/hidden units). Structured knowledge that and large datasets, where the crucial signals may still be underlies symbolic computing approaches and often supports sparse or ambiguous, threatening precision (Cheng 2018). reasoning, has also seen significant growth in recent years, Moreover, the ML models that are deployed in the ab- in the form of broad-based (e.g., DBPedia, Yago) and do- sence of transparency and accountability (Rudin 2019) and main, industry or application specific knowledge graphs. A common substrate with careful integration of the two will trained on biased datasets, can lead to grave consequences, raise opportunities to develop neuro-symbolic learning ap- such as potential social discrimination and unfair treatment proaches for AI, where conceptual and probabilistic repre- (Olteanu et al. 2019). Further, the potentially severe impli- sentations are combined. As the incorporation of external cations of false alarms in an ML-integrated real-world appli- knowledge will aid in supervising the learning of features for cation may affect millions of people (Kursuncu et al. 2019a; the model, deep infusion of representational knowledge from Kursuncu 2018). knowledge graphs within hidden layers will further enhance The fundamental challenges are common to a majority of the learning process. Although much work remains, we be- problems in a variety of domains with real world impact. lieve that knowledge graphs will play an increasing role in de- veloping hybrid neuro-symbolic intelligent systems (bottom- Specifically, these challenges are: (1) dependency on large up deep learning with top-down symbolic computing) as well datasets required for bottom-up, data-dependent ML algo- as in building explainable AI systems for which knowledge rithms (Valiant 2000; De Palma, Kiani, and Lloyd 2019), graphs will provide scaffolding for punctuating neural com- (2) bias in the dataset, enabling the model to emphpoten- puting. In this position paper, we describe our motivation for tially cause social discrimination and unfair treatment, (3) such a neuro-symbolic approach and framework that com- multidimensionality, ambiguity and sparsity, as the data in- bines knowledge graph and neural networks. volves unconstrained concepts and relationships with mean- ing from different contextual dimensions of the content such Introduction as religion, history and politics (Kursuncu et al. 2019a; Kursuncu 2018). Further, the limited number of labeled in- Data-driven bottom-up machine/deep learning (ML) and stances available for training may fail to represent the true top-down knowledge-driven approaches to creating reliable nature of concepts and relationships in data sets, leading to models, have shown remarkable success in specific areas, ambiguous or sparse true signals (4) the lack of information such as search, speech recognition, language translation, traceability for model explainability, (5) the coverage of in- computer vision, and autonomous vehicles. On the other formation specific to a domain that would be missed other- hand, they have had limited success in understanding and wise, (6) the complexity of model architecture in time and deciphering contextual information, such as detection of ab- space1 , and (7) false alarms in model performance. Conse- stract concepts in online/offline human interactions. Current quently, we believe standard separate knowledge graph KG ∗ Equally Contributed and ML methods are vulnerable to deduce or learn spurious Copyright c 2020 held by the author(s). In A. Martin, K. Hinkel- concepts and relationships that appear deceptively good on mann, H.-G. Fill, A. Gerber, D. Lenat, R. Stolle, F. van Harmelen a KG or training datasets, yet do not provide adequate re- (Eds.), Proceedings of the AAAI 2020 Spring Symposium on Com- bining Machine Learning and Knowledge Engineering in Practice 1 (AAAI-MAKE 2020). Stanford University, Palo Alto, California, https://www.theguardian.com/commentisfree/2019/nov/ USA, March 23-25, 2020. Use permitted under Creative Commons 16/can-planet-afford-exorbitant-power-demands-of-machine- License Attribution 4.0 International (CC BY 4.0). learning sults when the data set contains contextual and dynamically of which utilize external knowledge before/after the repre- changing concepts and relations. sentation has been generated. In this position paper, we describe innovations that will operationalize more abstract models built upon the charac- Neural Language Models (NLMs) teristics of a domain to render them computationally ac- NLMs are a category of neural networks capable of learn- cessible within neural network architectures. We propose ing sequential dependencies in a sentence, and preserve such a neuro-symbolic method, knowledge-infused learning that information while learning a representation. In particular, measures information loss in latent features learned by neu- LSTM (Long Short Term Memory) networks (Hochreiter ral networks through KGs with conceptual and structural and Schmidhuber 1997) have emerged from the failure of relationship information, for addressing the aforementioned RNNs (Recurrent Neural Networks) in remembering long- challenges. The infusion of knowledge during the represen- term information. Concerning the loss of contextual infor- tation learning phase raises the following central research mation while learning, (Cho et al. 2014) proposed a context- questions: (i) How do we decide whether to infuse knowl- feed forward LSTM architecture in which context is learned edge or not, at a particular stage while learning between by the previous layer merged with forgetting and modula- layers, and how to quantify knowledge to be infused? (ii) tion gates of the next layer. However, if erroneous contex- How to merge latent representations between layers with tual information is learned in previous layers, it is difficult external knowledge representations, and (iii) How to prop- to correct (Masse, Grant, and Freedman 2018), which is a agate the knowledge through the learned latent representa- problem magnified by noisy data and content sparsity (e.g. tion? Considering the future deployment of AI in applica- Twitter, Reddit, Blogs). tions, the potential impact of this approach is significant. As the inclusion of structured knowledge (e.g., Knowl- As stated in (Karpathy 2015), the deeper the network, the edge Graphs) in deep learning, improves information re- denser the representation and better the learning. A large trieval (Sheth and Kapanipathi 2016), prior research has number of parameters and the layered nature of neural net- shown the significance of knowledge in the pursuit of im- works make them modifiable based on specific problem proving NLMs, such as in commonsense reasoning (Liu and characteristics. However, the challenges (1, 3, 5 and 7) make Singh 2004). The transformer NLMs such as BERT, (De- neural networks vulnerable to the sudden appearance of vlin et al. 2018) (including its variants BioBert and SciB- relevant-but-sparse or ambiguous features, in often noisy ERT), are still data dependent. BERT has been utilized in big data (Valiant 2000; De Palma, Kiani, and Lloyd 2019; hybrid frameworks such as (Scarlini, Pasini, and Navigli Kursuncu et al. 2019b). On the other hand, KG-based ap- 2020) in the creation of sense embeddings using BabelNet proaches structure search within a feature space defined by and NASARI. (Liu et al. 2019a) proposed K-BERT, that en- domain experts. To compensate for the vulnerability of the riches the representations by injecting the triples from KGs aforementioned challenges, incorporating knowledge to the into the sentence. As this incorporation of knowledge for learned representation in principled fashion is required. A BERT takes place in the form of attention, we consider the promising approach is to base this on a measurable discrep- K-BERT as semi-deep infusion (Sheth et al. 2019). Simi- ancy between the knowledge captured in the neural network larly, ERNIE (Sun et al. 2019) incorporated external knowl- and external resources. edge to capture lexical, syntactic, and semantic information, Computational modeling coupled with knowledge infu- enriching BERT. sion in a neural network will disambiguate important con- cepts defined in a KG with their different semantic meanings Neural Attention Models (NAM) through its structural relations. Knowledge infusion will re- NAM (Rush, Chopra, and Weston 2015) highlights par- define the emphasis of sparse but essential, and irrelevant ticular features that are important for pattern recogni- but frequently occurring terms and concepts, boosting re- tion/classification based on a hierarchical architecture. The call without reducing precision. Further, it will provide ex- manipulation of attentional focus is effective in solving real planatory insight into the model, robustness to noise and re- world problems involving massive amounts of data (Halevy, duce dependency on frequency in the learning process. This Norvig, and Pereira 2009; Sun et al. 2017). On the other neuro-symbolic learning approach will potentially transform hand, some applications demonstrate the limitation of at- existing methods for data analysis and building computa- tentional manipulation in a set of problems such as sen- tional models. While the impact of this approach is trans- timent (mis)classification (Maurya 2018) and suicide risk ferable (and replicable) to a majority of domains, the ex- (Corbitt-Hall et al. 2016), where feature presence is inher- plicit implications are particularly apparent for social sci- ently ambiguous, just as in the online radicalization prob- ence (Kursuncu et al. 2019a) and healthcare domains (Gaur lem (Kursuncu et al. 2019a). For example, in the suicide et al. 2018). risk prediction task, references to suicide-related terminol- ogy appear in social media posts of both victims as well Related Work as supportive listeners, and the existing NAMs fail to cap- As the incorporation of knowledge has been explored in var- ture semantic relations between terms that help differenti- ious forms in prior research, in this section, we describe ate the suicidal user from a supportive user (Gaur et al. the methodologies and applications specifically related to 2019). To overcome such limitations in a sentiment clas- knowledge-infused learning: Neural language models, neu- sification task, (Vo et al. 2017) adds sentiment scores into ral attention models, knowledge based neural networks, all the feature set for enhancing the learned representation and modifies the loss function to respond to values of the sen- timent score during learning. However, (Sheth et al. 2017; Kho et al. 2019) have pointed out the importance of using domain-specific knowledge especially in cases where the problem is complex in nature (Perera et al. 2016). (Bian, Gao, and Liu 2014) has empirically demonstrated the effec- tiveness of combining richer semantics from domain knowl- edge with morphological and syntactic knowledge in the text, by modeling knowledge as an auxiliary task that reg- ularizes the learning of the main objective in a deep neural network. Knowledge-based Neural Networks (Yi et al. 2018) introduced a knowledge-based, recurrent at- tention neural network (KB-RANN) that modifies the at- tentional mechanism by incorporating domain knowledge to improve model generalization. However, their domain- knowledge is statistically derivable from the input data itself and is analogous to merely learning an interpolation function over the existing data. (Dugas et al. 2009) proposed a modifi- cation in the neural network by adopting Lipschitz functions for its activation function. (Hu et al. 2016) proposed a com- bination of deep neural networks with logic rules by employ- Figure 1: Contextual Dimension Modeling Diagram (Kur- ing knowledge distillation procedure (Hinton, Vinyals, and suncu et al. 2019a) Embedding algorithm above (W2V: Dean 2015) of transferring the learned tacit knowledge from Word2Vec) can be replaced by other algorithms such as larger neural network, to the weights of the smaller neural BERT. For each dimension, a specific corpus is utilized to network in data-limited settings. These studies for incorpo- create the model and the generated representation of content rating knowledge in a deep learning framework have not in- is concatenated. Generating the three contextual dimension volved declarative knowledge structures in the form of KGs representations of a social media post will emphasize the (e.g., DBpedia) (Chen et al. 2019). However, (Casteleiro et weights of such essential lexical cues. al. 2018) recently showed how the Cardiovascular Disease Ontology (CDO) provided context and reduced ambiguity, improving performance on a synonym detection task. (Shen Knowledge Graphs et al. 2018) employed embeddings of entities in a KG, de- A Knowledge graph (KG) is a conceptual model of a do- rived through Bi-LSTMs, to enhance the efficacy of NAMs. main that stores and structures declarative knowledge in a (Sarker et al. 2017) presented a conceptual framework for human and machine-readable format, constituting factual explaining artificial neural networks’ classification behavior ground truth and embodying a domain ontology of objects, using background knowledge on the semantic web. (Makni attributes, and relations. KGs rely on symbolic propositions, and Hendler ) explained a deep learning approach to learn employing generic conceptual relationships in taxonomies, RDFS2 rules from both synthetic and real-world seman- partonomies and specific content with labeled links. Exam- tic web data. They also claim their approach improves the ples include DBpedia, UMLS, and ICD-10. The factual in- noise-tolerance capabilities of RDFS reasoning. formation about the domain is represented in the form of All of the frameworks in the above subsections utilized instances (or individuals) of those concepts (or classes) and external knowledge before or after the representation has relationships (Gruber 2008; Sheth and Thirunarayan 2012). been generated by NAMs, rather than within the deep neural Therefore, a domain can be described or modeled through network as in our approach (Sheth et al. 2019). We propose KGs in a way that both computers and humans can under- a learning framework that infuses domain knowledge within stand. As KGs differentiate contextual nuances of concepts the latent layers of neural networks for modeling. in the content, they play a key role in our framework with extensive use by several functions. Preliminaries Symbolic representation of a domain, besides its probabilis- Contextual Modeling tic representation, is crucial for neuro-symbolic learning. In Capturing contextual cues in the language is crucial in our our approach, we propose to homogenize symbolic informa- approach; hence, we utilize NLMs to generate embeddings tion from KGs (see Section Knowledge Graphs) and contex- of the content. Recent embedding algorithms have emerged tual neural representations (see Section Contextual Model- to create such representations such as Word2Vec (Goldberg ing), in neural networks. and Levy 2014), GLoVe (Pennington, Socher, and Manning 2014), FastText (Athiwaratkun, Wilson, and Anandkumar 2 https://www.w3.org/2001/sw/wiki/RDFS 2018) and BERT (Devlin et al. 2018). Modeling context-sensitive problems in different domains an existing deep learning framework. (e.g., healthcare, cyber social threats, online extremism and We propose to further develop an innovative deep harassment), depends heavily on carefully designed features knowledge-infused learning approach that will reveal pat- to extract meaningful information, based on characteristics terns that are missed by traditional approaches because of of the problems and a ground truth dataset. Moreover, iden- sparse feature occurrence, feature ambiguity and noise. This tifying these characteristics and differentiating the content approach will support the following integrated aims: (i) In- requires different levels of granularity in the organization of fusion of Declarative Domain Knowledge in a Deep Learn- features. For instance, in the problem of online Islamist ex- ing framework, and (ii) Optimal Sub-Knowledge Graph tremism, the information being shared in social media posts Creation and Evolution. The overall architecture in Figure by users in extremism-related social networks displays an 2 guides our proposed research on these two aims. Our intent that depends on the user’s type (e.g., recruiter, fol- methods will disambiguate important concepts defined in lower). Hence, as these user types show different character- the respective KGs with their different semantic meanings istics (Kursuncu et al. 2018), for reliable analysis, it is criti- through its structural relations. Knowledge infusion will re- cal to consider different contextual dimensions (Kursuncu et define the emphasis of sparse-but-essential, and irrelevant- al. 2019a; Kursuncu 2018). Moreover, the ambiguity of di- but-frequently-occurring terms and concepts, boosting recall agnostic terms (e.g., jihad) also mandates representation of without reducing precision. terms in different contexts. Hence, to better reflect these dif- ferences, creating multiple models enables us to represent Knowledge-Infused Learning the multiple contextual dimensions for a reliable analysis. Each layer in a neural network architecture produces a la- Figure 1 details the contextual dimension modeling work- tent representation of the input vector (ht ). The infusion of flow. knowledge during the representation learning phase raises the following central research questions: R1: How do we A Proposed Comprehensive Approach decide whether to infuse knowledge or not, at a particular Although the existing research (Gaur et al. 2018; Bhatt et stage while learning between layers, and how to quantify al. 2018a) shows the contribution of incorporating external knowledge to be infused? R2: How to merge latent repre- knowledge in ML, this incorporation mostly takes place be- sentations between layers with external knowledge represen- fore or after the actual learning process (e.g., feature extrac- tations, and R3: How to propagate the knowledge through tion, validation); thus remaining shallow. We believe that the learned latent representation? We propose to define two deep knowledge infusion, within the hidden layers of neu- functions to address these two questions: Knowledge-Aware ral networks, will greatly improve the performance by: (i) Loss Function (K-LF) and Knowledge Modulation Function reducing false alarm and information loss, (ii) boosting re- (K-MF), respectively. call without sacrificing precision, (iii) providing finer gran- Configurations of neural networks can be designed in var- ular representation, (iv) enabling explainability (Islam et al. ious ways depending on the problem. As our aim is to in- 2019; Kursuncu et al. 2019c) and (v) reducing bias. Specifi- fuse knowledge within the neural network, such an opera- cally, we believe that it will become a critical and integral tion can take place (i) before the output layer (e.g., Soft- component of AI models that are integrated in deployed Max), (ii) between hidden layers (e.g., reinforcing the gates tools, e.g, in healthcare, where domain knowledge is cru- of an NLM layer, modulating the hidden states of NLM lay- cial and indispensable in decision making processes. For- ers, Knowledge-driven NLM dropout and recurrent dropout tunately, these domains are rich in terms of their respec- between layers). To illustrate (i), we describe our initial ap- tive machine-readable knowledge resources, such as manu- proach to neural language models that infuses knowledge ally curated medical KGs (e.g., UMLS (McInnes, Pedersen, before the output layer, which we believe will shed the light and Pakhomov 2009), ICD-10 (Brouch 2000) and DataMed towards a reliable and robust solution with more research (Ohno-Machado et al. 2017)). In our prior research (Gaur and rigorous experimentations. et al. 2018), we utilized ML models coupled with these KGs to predict mental health disorders among 20 Mental Seeded Sub-Knowledge Graph The Seeded Sub- Disorders (defined in the DSM-5) for Reddit posts. Typical Knowledge Graph, is a subset of KGs, which participate approaches for such predictions employ word embeddings, broadly in our technical approach. Generic KGs (e.g., such as Word2Vec, resulting in sub-optimal performance DBpedia (Bizer et al. 2009), YAGO2 (Hoffart et al. 2013), when they are used in domain-specific tasks. We have in- Freebase (Bollacker et al. 2008)) may contain over a corporated knowledge into the embeddings of Reddit posts million entities and close to a billion relationships. Using by (i) using Zero Shot learning (Palatucci et al. 2009), (ii) the entire graph of linked data on the web can cause; (1) modulating (e.g., re-weighting) their embeddings, similar unnecessary computation and (2) noise due to irrelevant to NAMs, and obtained a significant reduction in the false knowledge, and has sometimes failed to benefit intelli- alarm rate, from 13% (without knowledge) to 2.5% (with gent application (Roy, Park, and Pan 2017). However, knowledge). In another study, we have leveraged the domain real-world problems are domain-specific and require only knowledge in KGs to validate model weights that explain di- a relevant (sub) portion of the full graph. Creation of a verse crowd behavior in the Fantasy Premier League partic- Seeded Sub-KG (Lalithsena 2018) based on a ground truth ipants (FPL) (Bhatt et al. 2018b). However, very little previ- dataset is needed, to represent a particular domain using ous work has tried to integrate such functional knowledge to information-theoretic approaches (e.g., KL divergence) Figure 2: Overall Architecture: Contextual representations of data are generated, and domain knowledge amplifies the signifi- cance of specific important concepts that are missed in the learning model. Classification error determines the need for updating a Seeded SubKG with more relevant knowledge, resulting in a Seeded SubKG that is more refined and informative to our model. and probabilistic soft logic (Kimmig et al. 2012). Further, output layer. The output layer (e.g., SoftMax) of the NLM a sub-graph discovery approach (Cameron et al. 2015; model estimates the error to be back-propagated. As the Lalithsena 2018) can also be used utilizing probabilistic techniques for knowledge infusion between hidden layers or graphical models (e.g., deep belief networks, conditional just before the output layers will be explored, in this sub- random fields). In our approach, the Seeded SubKG will be section, we explain the Knowledge Infusion Layer (K-IL) updated with more knowledge based on difference between which takes place just before the output layer. the learned representation and relevant knowledge repre- sentation from the KG (see Section Differential Knowledge Algorithm 1 Routine for Infusion of Knowledge in NLMs Engine). 1: procedure K NOWLEDGE I NFUSION Ke : Knowledge Embedding Creation Representation of 2: Data : N LMtype , #Epochs, #Iter, Ke −−→ knowledge in the Seeded SubKG will be generated as 3: Output : MT embedding vectors. Specific contextual dimension models 4: for ne=1 to #Epochs do −→ −−−→ and/or more generic models can be utilized to create an em- 5: hT , hT −1 ← TrainingNLM(N LMT ype ,#Iter) −−−→ − → − → − → bedding of each concept and their relations in the Seeded 6: while (DKL (hT −1 ||Ke ) − DKL (hT ||Ke ) > ) do SubKG. Unlike traditional approaches that compute the rep- −→ − → 7: hT ← σ(Whk ∗ (hT ⊕ Ke ) + bhk ) resentation of each concept in the KGs by simply taking an 8: W hk ← W hk - ηk ∇(K-LF) average of embedding vectors of concepts, we leverage the −−→ − → 9: MT ← hT W hk existing structural information of the graph. This procedure −−→ 10: return: MT is formally defined: X O Ke = [Ci , Cj ] Dij (1) Algorithm 1 takes the type of neural language model, ij number of epochs, iterations and the seeded knowledge graph embedding Ke as input, and returns a knowledge in- where Ke is the representation of the concepts enriched fused representation of the hidden state MT . In line 4, the by the relationships in the Seeded-KG, (Ci , Cj ) is the rele- infusion of knowledge takes place after each epoch without vant pair of concepts in the Seeded-KG, Dij is the distance obstructing the learning of the vanilla NLM model and is ex- measure (e.g., Least Common Subsumer (Baader, Sertkaya, plained in lines 5-10. Within the knowledge infusion process and Turhan 2007)) between the two concepts Ci and Cj . (lines 7-9), we optimize the loss function in equation 2 with Novel methods will further be examined building upon this convergence condition defined as the reduction in the dif- initial approach above as well as existing tools that include ference between the DKL of hT and hT−1 in the presence TRANS-E (Bordes et al. 2013), TRANS-H (Wang et al. of Ke . Considering the vanilla structure of a NLM (Greff et 2014), and HOLE (Nickel et al. 2016) for the creation of al. 2017), MT is utilized by the fully connected layer for embeddings from KGs. classification. Knowledge Infusion Layer In a many-to-one NLM To illustrate an initial approach in Figure 3, we use (Shivakumar et al. 2018) network with T hidden layers, LSTMs as NLMs in our neural network. K-IL functions add the Tth layer contains the learned representation before the an additional layer before the output layer of our proposed maximize the information gain from the knowledge repre- sentation (e.g., Ke ). We will compute differential knowl- edge (∇K-LF) through such optimization approach; thus, the computed differential knowledge will also determine the degree of knowledge to be infused. ∇K-LF will be com- puted in the form of embedding vectors, and the dimensions from Ke will be preserved. K-MF: Knowledge Modulation Function We need to merge the differential knowledge representation with the partially learned representation. However, this operation cannot be done arbitrarily as the vector spaces of both repre- sentations are different both in dimension and distribution if Figure 3: Inner Mechanism of the Knowledge Infusion not same (Dumančić and Blockeel 2017). We explain an ini- Layer tial approach for the K-MF to modulate the learned weight matrix of the neural network with the hidden vector through an appropriate operation (e.g., Hadamard pointwise multi- neural network architecture. This layer takes the latent vec- plication). This operation at the Tth layer can be formulated tor (hT−1 ) of the penultimate layer, the latent vector of the as: last hidden layer (hT ) and the knowledge embedding (Ke ), Equation for W hk = W hk − ηk ∗ ∇K-LF, where W hk is as input. the learned weight matrix infusing knowledge, ηk is learn- In this layer, we define two particular functions that will ing momentum (Sutskever et al. 2013), ∇K-LF is differ- be critical for merging the latent vectors from the hidden ential knowledge. The weight matrix (W hk ) is computed layers and the knowledge embedding vector from the KG. through the learning epochs utilizing the differential knowl- Note that the dimensions of these vectors are the same be- edge embedding (∇K-LF). Then we merge W hk with the cause they are created from the same models (e.g., contex- hidden vector hT through the K-MF. Considering that we tual models), which makes the merge operation of those vec- use Hadamard pointwise multiplication as our initial ap- tors possible and valid. proach, we formally define the output MT of K-MF as: K-LF: Knowledge-Aware Loss Function In neural net- This operation at the Tth layer can be formulated as: works, hidden layers may de-emphasize important patterns due to the sparsity of certain features during learning, which M~T = h~T W hk (3) causes information loss. In some cases, such patterns may where MT is Knowledge-Modulated representation, hT not even appear in the data. However, such relations or pat- is the hidden vector and W hk is the learned weight matrix terns may be defined in KGs with even more relevant knowl- infusing knowledge. Further investigations of techniques for edge. We call this information gap between the learned rep- K-MF constitutes a central research topic for the research resentation of the data and knowledge representation as dif- community. ferential knowledge. Information loss in a learning process is relative to the distribution that suffered the loss. Hence, we Differential Knowledge Engine propose a measure to determine the differential knowledge and guide the degree of knowledge infusion in learning. As In deep neural networks, each epoch generates an error that our initial approach to this measure, we developed a two- is back-propagated until the model reaches a saddle point state regularized loss function by utilizing Kullback Leibler in the local minima, and the error is reduced in each epoch. (KL) divergence. Our choice of KL divergence measure is The error indicates the difference between probabilities of largely influenced by the Markov assumptions made in lan- actual and predicted labels, and this difference can be used guage modeling and have been highlighted in (Longworth to enrich the Seeded SubKG in our proposed knowledge- 2010). The K-LF measure estimates the divergence between infused learning (K-IL) framework. the hidden representations (hT−1 , hT ) and knowledge rep- In this subsection, we discuss the sub-knowledge graph resentation (Ke ), to determine the differential knowledge to operations that are based on the difference between the be infused. learned representation of our knowledge-infused model Formally we define it as: (MT ), and the representation of the relevant sub-knowledge arg min(hT~−1 , h~T , K ~ e ) ≡ K-LF, where hT−1 is an input graph from the KG, which we name the differential sub- for convergence constraint. knowledge graph. We define a Knowledge Proximity func- tion to generate the Differential Sub-knowledge Graph, and Update Seeded SubKG to insert the differential sub- K-LF = min DKL (h~T ||K ~ e ); knowledge graph into the Seeded SubKG. (2) s.t.DKL (h~T ||K ~ e ) < DKL (hT~−1 ||K~e ) Knowledge Proximity Upon the arrival of the learned representation from the knowledge-infused learning model, we query the KG for retrieving related information to the We minimize the relative entropy for information loss to respective data point. In this particular step, it is important to find the optimal proximity between the concept and its predictive analysis of online communications such as misin- related concepts. For example, from the “South Carolina” formation and extremism, conversational modeling, and dis- concept, we may traverse the surrounding concepts with a ease prediction. varying number of hops (empirically decided). Finding the As predicting online extremism is challenging and false optimal number of hops towards each direction from the alarms create serious implications potentially affecting mil- concept in question is still an open research question. As lions of individuals, (Kursuncu et al. 2019a) showcased that we find optimal proximity of a particular concept in the KG, the (shallow) infusion of external domain-specific knowl- we propagate KG based on the proximity starting from the edge improves precision, reducing potential social discrim- concept in question. ination. Further, in prediction of mental health diseases de- Differential SubKG Once we obtain the SubKG from the fined in DSM-5, (Gaur et al. 2018) shallow knowledge infu- graph propagation, we create a differential SubKG that will sion reduces false alarms by 30%. On the other hand, conver- reflect the difference in knowledge from the Seeded SubKG. sational models pose an important application area as (Liu For this procedure, research is needed to formulate the prob- et al. 2019b) proposed a conversation framework where the lem using variational autoencoders to extract a differential fusion of KGs and text mutually reinforce each other to gen- subKG(Dkg ) and, we believe it will provide missing infor- erate knowledge-aware responses, improving the model in mation in the Seeded-KG. generalizability and explainability. In another study, (Young et al. 2018) integrated commonsense knowledge into the Update function The differential subKG generated as a conversational models selecting the most appropriate re- result of minimizing knowledge proximity is considered as sponse. While machine learning finds many application ar- an input factual graph to the update procedure. As a re- eas in medicine for disease prediction, large data is not al- sult, the procedure dynamically evolves the Seeded subKG ways available. In this case knowledge-infused learning gen- with missing information from differential subKG. We pro- erates more representative features thereby avoiding overfit- pose to utilize Lyapunov stability theorem (Liu, Zhang, and ting. A study (Tan et al. 2019) on early diagnosis of lung Chen 2014) and Zero Shot learning to update the Seeded- cancer using computed tomography images, infused knowl- KG using Dkg . Dkg and Seeded-KG represent two knowl- edge in the form of expert-curated features into the learn- edge structures requiring a process of transferring the knowl- ing process through CNN. Despite the small data set, the edge from one structure to another (Hamaguchi et al. 2017). enriched feature space in their knowledge-infused learning We define this process as generation of semantic mapping process improved sensitivity and specificity of the model. weights that encodes and decodes the two semantic spaces, In contrast to the applications above, we believe that the utilizing the Lyapunov stability constraint and Sylvester op- deep infusion of external knowledge within latent layers will timization approach: Given two semantic spaces belonging enhance the coverage of the information being learned by to a domain D (e.g., online extremism, mental health), we the model based on KGs. Hence, this will provide better tend to attain an equilibrium position defined as: generalizability, reduction in bias and false alarms, disam- biguation, less reliance on large data, explainability, reliabil- ity and robustness, to the real world applications in critical ||Skg − W ∗ Dkg ||2F = α ∗ ||W ∗ Skg − Dkg ||2F (4) aforementioned domains with significant impact. || . ||F represents Frobenius norm and α is a proportional- ity constant belong to R. Equation 4 reflects Lyapunov sta- Conclusion bility theorem and to achieve such a stable state we define our optimization function as follows: Combining deep learning and knowledge graphs in a hybrid neural-symbolic learning framework will further enhance L = min(||Skg − W Dkg ||2F − α ∗ ||W Skg − Dkg ||2F ), performance and accelerate the convergence of the learn- ing processes. Specifically, the impact of this improvement α > 0, W ∈ RXR in very sensitive domains such as health and social science, (5) will be significant with respect to their implications for real- Equation 5 is solvable using Sylvester optimization and world deployment. Adoption of the tools that automate tasks its derivation is defined in a recent study (Gaur et al. 2018). that require knowledge and intelligence, and are traditionally done by humans, will improve with the help of this frame- work that marries deep learning and knowledge graph tech- Applications for K-IL niques. Specifically, we envision that the infusion of knowl- Artificial intelligence models will be widely deployed in real edge as described in this framework will capture information world decision making processes in the foreseeable future, for the corresponding domain in finer granularity of abstrac- once the challenges described in Section 1, are overcome. As tion. We believe that this approach will provide reliable so- we argue that the incorporation of external structured knowl- lutions to the problems faced in deep learning, as described edge will address these challenges, it will benefit various in Sections 1 and 5. Hence, in real world applications, re- application domains such as social and health sciences, au- solving these issues with both knowledge graphs and deep tomating processes that require knowledge and intelligence. learning in a hybrid neuro-symbolic framework will greatly Specifically, it will have a potentially significant impact on contribute to fulfilling AI’s promise. Acknowledgement Chen, B.; Hao, Z.; Cai, X.; Cai, R.; Wen, W.; Zhu, J.; and We acknowledge partial support from the National Science Xie, G. 2019. Embedding logic rules into recurrent neural Foundation (NSF) award CNS-1513721: “Context-Aware networks. IEEE Access 7:14938–14946. Harassment Detection on Social Media". Any opinions, con- Cheng, J. 2018. Ai reasoning systems: Pac and applied clusions or recommendations expressed in this material are methods. arXiv preprint arXiv:1807.05054. those of the authors and do not necessarily reflect the views Cho, K.; Van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; of the NSF. Bougares, F.; Schwenk, H.; and Bengio, Y. 2014. Learning phrase representations using rnn encoder-decoder for statis- References tical machine translation. arXiv preprint arXiv:1406.1078. Corbitt-Hall, D. J.; Gauthier, J. M.; Davis, M. T.; and Witte, Athiwaratkun, B.; Wilson, A. G.; and Anandkumar, A. T. K. 2016. College students’ responses to suicidal content 2018. Probabilistic fasttext for multi-sense word embed- on social networking sites: an examination using a simulated dings. arXiv preprint arXiv:1806.02901. facebook newsfeed. Suicide and Life-Threatening Behavior Baader, F.; Sertkaya, B.; and Turhan, A.-Y. 2007. Comput- 46(5):609–624. ing the least common subsumer wrt a background terminol- De Palma, G.; Kiani, B.; and Lloyd, S. 2019. Random deep ogy. Journal of Applied Logic 5(3):392–420. neural networks are biased towards simple functions. In Bhatt, S.; Gaur, M.; Bullemer, B.; Shalin, V.; Sheth, A.; Advances in Neural Information Processing Systems, 1962– and Minnery, B. 2018a. Enhancing crowd wisdom using 1974. explainable diversity inferred from social media. In 2018 Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2018. IEEE/WIC/ACM International Conference on Web Intelli- Bert: Pre-training of deep bidirectional transformers for lan- gence (WI), 293–300. IEEE. guage understanding. arXiv preprint arXiv:1810.04805. Bhatt, S.; Gaur, M.; Bullemer, B.; Shalin, V. L.; Sheth, A. P.; Dugas, C.; Bengio, Y.; Bélisle, F.; Nadeau, C.; and Gar- and Minnery, B. 2018b. Enhancing crowd wisdom us- cia, R. 2009. Incorporating functional knowledge in ing explainable diversity inferred from social media. In neural networks. Journal of Machine Learning Research IEEE/WIC/ACM International Conference on Web Intelli- 10(Jun):1239–1262. gence. Santiago, Chile: IEEE. Dumančić, S., and Blockeel, H. 2017. Demystifying rela- Bian, J.; Gao, B.; and Liu, T.-Y. 2014. Knowledge-powered tional latent representations. In International Conference on deep learning for word embedding. In Joint European con- Inductive Logic Programming, 63–77. Springer. ference on machine learning and knowledge discovery in Gaur, M.; Kursuncu, U.; Alambo, A.; Sheth, A.; Daniu- databases, 132–148. Springer. laityte, R.; Thirunarayan, K.; and Pathak, J. 2018. " let me Bizer, C.; Lehmann, J.; Kobilarov, G.; Auer, S.; Becker, C.; tell you about your mental health!" contextualized classifi- Cyganiak, R.; and Hellmann, S. 2009. Dbpedia-a crystal- cation of reddit posts to dsm-5 for web-based intervention. lization point for the web of data. Web Semantics: science, Gaur, M.; Alambo, A.; Sain, J. P.; Kursuncu, U.; services and agents on the world wide web 7(3):154–165. Thirunarayan, K.; Kavuluru, R.; Sheth, A.; Welton, R.; and Bollacker, K.; Evans, C.; Paritosh, P.; Sturge, T.; and Taylor, Pathak, J. 2019. Knowledge-aware assessment of severity of J. 2008. Freebase: a collaboratively created graph database suicide risk for early intervention. In The World Wide Web for structuring human knowledge. In Proceedings of the Conference, 514–525. 2008 ACM SIGMOD international conference on Manage- Goldberg, Y., and Levy, O. 2014. word2vec explained: de- ment of data, 1247–1250. AcM. riving mikolov et al.’s negative-sampling word-embedding Bordes, A.; Usunier, N.; Garcia-Duran, A.; Weston, J.; and method. arXiv preprint arXiv:1402.3722. Yakhnenko, O. 2013. Translating embeddings for model- Greff, K.; Srivastava, R. K.; Koutník, J.; Steunebrink, B. R.; ing multi-relational data. In Advances in neural information and Schmidhuber, J. 2017. Lstm: A search space odyssey. processing systems, 2787–2795. IEEE transactions on neural networks and learning systems Brouch, K. L. 2000. Where in the world is icd-10? Where 28(10):2222–2232. in the World Is ICD-10?/AHIMA, American Health Informa- Gruber, T. 2008. Ontology, encyclopedia of database sys- tion Management Association. tems, ling liu and m. Tamer Özsu (Eds.). Cameron, D.; Kavuluru, R.; Rindflesch, T. C.; Sheth, A. P.; Halevy, A.; Norvig, P.; and Pereira, F. 2009. The unreason- Thirunarayan, K.; and Bodenreider, O. 2015. Context-driven able effectiveness of data. IEEE Intelligent Systems 24(2):8– automatic subgraph creation for literature-based discovery. 12. Journal of biomedical informatics 54:141–157. Hamaguchi, T.; Oiwa, H.; Shimbo, M.; and Matsumoto, Y. Casteleiro, M. A.; Demetriou, G.; Read, W.; Prieto, M. 2017. Knowledge transfer for out-of-knowledge-base en- J. F.; Maroto, N.; Fernandez, D. M.; Nenadic, G.; Klein, J.; tities: a graph neural network approach. arXiv preprint Keane, J.; and Stevens, R. 2018. Deep learning meets on- arXiv:1706.05674. tologies: experiments to anchor the cardiovascular disease Hinton, G.; Vinyals, O.; and Dean, J. 2015. Distill- ontology in the biomedical literature. Journal of biomedical ing the knowledge in a neural network. arXiv preprint semantics 9(1):13. arXiv:1503.02531. Hochreiter, S., and Schmidhuber, J. 1997. Long short-term Wang, P. 2019a. K-bert: Enabling language representation memory. Neural computation 9(8):1735–1780. with knowledge graph. arXiv preprint arXiv:1909.07606. Hoffart, J.; Suchanek, F. M.; Berberich, K.; and Weikum, G. Liu, Z.; Niu, Z.-Y.; Wu, H.; and Wang, H. 2019b. Knowl- 2013. Yago2: A spatially and temporally enhanced knowl- edge aware conversation generation with explainable rea- edge base from wikipedia. Artificial Intelligence 194:28–61. soning over augmented graphs. In Proceedings of the Hu, Z.; Ma, X.; Liu, Z.; Hovy, E.; and Xing, E. 2016. 2019 Conference on Empirical Methods in Natural Lan- Harnessing deep neural networks with logic rules. arXiv guage Processing and the 9th International Joint Confer- preprint arXiv:1603.06318. ence on Natural Language Processing (EMNLP-IJCNLP), 1782–1792. Islam, S. R.; Eberle, W.; Bundy, S.; and Ghafoor, S. K. 2019. Infusing domain knowledge in ai-based" black box" models Liu, M.; Zhang, D.; and Chen, S. 2014. Attribute rela- for better explainability with application in bankruptcy pre- tion learning for zero-shot classification. Neurocomputing diction. arXiv preprint arXiv:1905.11474. 139:34–46. Karpathy, A. 2015. The unreasonable effectiveness of re- Longworth, C. 2010. Kernel methods for text-independent current neural networks. Andrej Karpathy blog 21. speaker verification. Ph.D. Dissertation, University of Cam- bridge. Kho, S. J.; Padhee, S.; Bajaj, G.; Thirunarayan, K.; and Sheth, A. 2019. Domain-specific use cases for knowledge- Makni, B., and Hendler, J. Deep learning for noise-tolerant enabled social media analysis. In Emerging Research Chal- rdfs reasoning. lenges and Opportunities in Computational Social Network Masse, N. Y.; Grant, G. D.; and Freedman, D. J. 2018. Alle- Analysis and Mining. Springer. 233–246. viating catastrophic forgetting using context-dependent gat- Kimmig, A.; Bach, S.; Broecheler, M.; Huang, B.; and ing and synaptic stabilization. Proceedings of the National Getoor, L. 2012. A short introduction to probabilistic soft Academy of Sciences 115(44):E10467–E10475. logic. In Proceedings of the NIPS Workshop on Probabilistic Maurya, A. K. 2018. Learning low dimensional word based Programming: Foundations and Applications, 1–4. linear classifiers using data shared adaptive bootstrap aggre- Kursuncu, U.; Gaur, M.; Lokala, U.; Illendula, A.; gated lasso with application to imdb data. arXiv preprint Thirunarayan, K.; Daniulaityte, R.; Sheth, A.; and Arpinar, arXiv:1807.10623. I. B. 2018. " what’s ur type?" contextualized classification of McInnes, B. T.; Pedersen, T.; and Pakhomov, S. V. 2009. user types in marijuana-related communications using com- Umls-interface and umls-similarity: open source software positional multiview embedding. In IEEE/WIC/ACM Inter- for measuring paths and semantic similarity. In AMIA An- national Conference on Web Intelligence(WI’18). nual Symposium Proceedings, volume 2009, 431. American Kursuncu, U.; Gaur, M.; Castillo, C.; Alambo, A.; Medical Informatics Association. Thirunarayan, K.; Shalin, V.; Achilov, D.; Arpinar, I. B.; Nickel, M.; Rosasco, L.; Poggio, T. A.; et al. 2016. Holo- and Sheth, A. 2019a. Modeling islamist extremist com- graphic embeddings of knowledge graphs. In AAAI, vol- munications on social media using contextual dimensions: ume 2, 3–2. Religion, ideology, and hate. Proceedings of the ACM on Ohno-Machado, L.; Sansone, S.-A.; Alter, G.; Fore, I.; Human-Computer Interaction 3(CSCW):151. Grethe, J.; Xu, H.; Gonzalez-Beltran, A.; Rocca-Serra, P.; Kursuncu, U.; Gaur, M.; Lokala, U.; Thirunarayan, K.; Gururaj, A. E.; Bell, E.; et al. 2017. Finding useful data Sheth, A.; and Arpinar, I. B. 2019b. Predictive analysis on across multiple biomedical data repositories using datamed. twitter: Techniques and applications. In Emerging Research Nature genetics 49(6):816. Challenges and Opportunities in Computational Social Net- Olteanu, A.; Castillo, C.; Diaz, F.; and Kiciman, E. 2019. work Analysis and Mining. Springer. 67–104. Social data: Biases, methodological pitfalls, and ethical Kursuncu, U.; Gaur, M.; Thirunarayan, K.; and Sheth, A. boundaries. Frontiers in Big Data 2:13. 2019c. Explainability of medical ai through domain knowl- Palatucci, M.; Pomerleau, D.; Hinton, G. E.; and Mitchell, edge. Ontology Summit 2019, Medical Explanation. T. M. 2009. Zero-shot learning with semantic output Kursuncu, U. 2018. Modeling the Persona in Persua- codes. In Advances in neural information processing sys- sive Discourse on Social Media Using Context-aware and tems, 1410–1418. Knowledge-driven Learning. Ph.D. Dissertation, University Pennington, J.; Socher, R.; and Manning, C. 2014. Glove: of Georgia. Global vectors for word representation. In Proceedings of Lalithsena, S. 2018. Domain-specific knowledge extraction the 2014 conference on empirical methods in natural lan- from the web of data. Ph.D. Dissertation, Wright State Uni- guage processing (EMNLP), 1532–1543. versity. Perera, S.; Mendes, P. N.; Alex, A.; Sheth, A. P.; and Liu, H., and Singh, P. 2004. Commonsense reasoning Thirunarayan, K. 2016. Implicit entity linking in tweets. In in and over natural language. In International Conference International Semantic Web Conference, 118–132. Springer. on Knowledge-Based and Intelligent Information and Engi- Roy, A.; Park, Y.; and Pan, S. 2017. Learning domain- neering Systems, 293–306. Springer. specific word embeddings from sparse cybersecurity texts. Liu, W.; Zhou, P.; Zhao, Z.; Wang, Z.; Ju, Q.; Deng, H.; and arXiv preprint arXiv:1709.07470. Rudin, C. 2019. Stop explaining black box machine learn- Vo, K.; Pham, D.; Nguyen, M.; Mai, T.; and Quan, T. ing models for high stakes decisions and use interpretable 2017. Combination of domain knowledge and deep learn- models instead. Nature Machine Intelligence 1(5):206–215. ing for sentiment analysis. In International Workshop on Rush, A. M.; Chopra, S.; and Weston, J. 2015. A neural at- Multi-disciplinary Trends in Artificial Intelligence, 162– tention model for abstractive sentence summarization. arXiv 173. Springer. preprint arXiv:1509.00685. Wang, Z.; Zhang, J.; Feng, J.; and Chen, Z. 2014. Knowl- Sarker, M. K.; Xie, N.; Doran, D.; Raymer, M.; and Hit- edge graph embedding by translating on hyperplanes. In zler, P. 2017. Explaining trained neural networks with AAAI, volume 14, 1112–1119. semantic web technologies: First steps. arXiv preprint Yi, K.; Jian, Z.; Chen, S.; Chen, Y.; and Zheng, N. 2018. arXiv:1710.04324. Knowledge-based recurrent attentive neural network for Scarlini, B.; Pasini, T.; and Navigli, R. 2020. Sensembert: traffic sign detection. arXiv preprint arXiv:1803.05263. Context-enhanced sense embeddings for multilingual word Young, T.; Cambria, E.; Chaturvedi, I.; Zhou, H.; Biswas, sense disambiguation. In Proc. of AAAI. S.; and Huang, M. 2018. Augmenting end-to-end dialogue Shen, Y.; Deng, Y.; Yang, M.; Li, Y.; Du, N.; Fan, W.; and systems with commonsense knowledge. In Thirty-Second Lei, K. 2018. Knowledge-aware attentive neural network AAAI Conference on Artificial Intelligence. for ranking question answer pairs. In The 41st International ACM SIGIR Conference on Research & Development in In- formation Retrieval, 901–904. ACM. Sheth, A., and Kapanipathi, P. 2016. Semantic filtering for social data. IEEE Internet Computing 20(4):74–78. Sheth, A., and Thirunarayan, K. 2012. Semantics empow- ered web 3.0: managing enterprise, social, sensor, and cloud- based data and services for advanced applications. Synthesis Lectures on Data Management 4(6):1–175. Sheth, A.; Perera, S.; Wijeratne, S.; and Thirunarayan, K. 2017. Knowledge will propel machine understand- ing of content: Extrapolating from current examples. arXiv preprint arXiv:1707.05308. Sheth, A.; Gaur, M.; Kursuncu, U.; and Wickramarachchi, R. 2019. Shades of knowledge-infused learning for enhanc- ing deep learning. IEEE Internet Computing 23(6):54–63. Shivakumar, P. G.; Li, H.; Knight, K.; and Georgiou, P. 2018. Learning from past mistakes: Improving automatic speech recognition output via noisy-clean phrase context modeling. arXiv preprint arXiv:1802.02607. Sun, C.; Shrivastava, A.; Singh, S.; and Gupta, A. 2017. Re- visiting unreasonable effectiveness of data in deep learning era. In Computer Vision (ICCV), 2017 IEEE International Conference on, 843–852. IEEE. Sun, Y.; Wang, S.; Li, Y.; Feng, S.; Chen, X.; Zhang, H.; Tian, X.; Zhu, D.; Tian, H.; and Wu, H. 2019. Ernie: En- hanced representation through knowledge integration. arXiv preprint arXiv:1904.09223. Sutskever, I.; Martens, J.; Dahl, G.; and Hinton, G. 2013. On the importance of initialization and momentum in deep learning. In International conference on machine learning, 1139–1147. Tan, J.; Huo, Y.; Liang, Z.; and Li, L. 2019. Expert knowledge-infused deep learning for automatic lung nod- ule detection. Journal of X-ray science and technology 27(1):17–35. Topol, E. J. 2019. High-performance medicine: the conver- gence of human and artificial intelligence. Nature medicine 25(1):44–56. Valiant, L. G. 2000. Robust logics. Artificial Intelligence 117(2):231–253.