Knowledge Infused Learning (K-IL):
                 Towards Deep Incorporation of Knowledge in Deep Learning

                                       Ugur Kursuncu∗ , Manas Gaur∗ , Amit Sheth
                                            AI Institute, University of South Carolina
                                                       Columbia, SC, USA
                                  {kursuncu@mailbox.sc.edu, mgaur@email.sc.edu, amit@sc.edu}


                            Abstract                                challenges in the translation of research methods and re-
  Learning the underlying patterns in data goes beyond
                                                                    sources into practice often draw from a class of rarely stud-
  instance-based generalization to external knowledge repre-        ied problems that do not yield to contemporary bottom-up
  sented in structured graphs or networks. Deep learning that       ML methods. Policymakers and practitioners assert serious
  primarily constitutes neural computing stream in AI has           usability concerns that constrain adoption, notably in high-
  shown significant advances in probabilistically learning la-      consequence domains (Topol 2019). In most cases, data-
  tent patterns using a multi-layered network of computational      dependent ML algorithms require high computing power
  nodes (i.e., neurons/hidden units). Structured knowledge that     and large datasets, where the crucial signals may still be
  underlies symbolic computing approaches and often supports        sparse or ambiguous, threatening precision (Cheng 2018).
  reasoning, has also seen significant growth in recent years,      Moreover, the ML models that are deployed in the ab-
  in the form of broad-based (e.g., DBPedia, Yago) and do-          sence of transparency and accountability (Rudin 2019) and
  main, industry or application specific knowledge graphs. A
  common substrate with careful integration of the two will
                                                                    trained on biased datasets, can lead to grave consequences,
  raise opportunities to develop neuro-symbolic learning ap-        such as potential social discrimination and unfair treatment
  proaches for AI, where conceptual and probabilistic repre-        (Olteanu et al. 2019). Further, the potentially severe impli-
  sentations are combined. As the incorporation of external         cations of false alarms in an ML-integrated real-world appli-
  knowledge will aid in supervising the learning of features for    cation may affect millions of people (Kursuncu et al. 2019a;
  the model, deep infusion of representational knowledge from       Kursuncu 2018).
  knowledge graphs within hidden layers will further enhance           The fundamental challenges are common to a majority of
  the learning process. Although much work remains, we be-          problems in a variety of domains with real world impact.
  lieve that knowledge graphs will play an increasing role in de-
  veloping hybrid neuro-symbolic intelligent systems (bottom-
                                                                    Specifically, these challenges are: (1) dependency on large
  up deep learning with top-down symbolic computing) as well        datasets required for bottom-up, data-dependent ML algo-
  as in building explainable AI systems for which knowledge         rithms (Valiant 2000; De Palma, Kiani, and Lloyd 2019),
  graphs will provide scaffolding for punctuating neural com-       (2) bias in the dataset, enabling the model to emphpoten-
  puting. In this position paper, we describe our motivation for    tially cause social discrimination and unfair treatment, (3)
  such a neuro-symbolic approach and framework that com-            multidimensionality, ambiguity and sparsity, as the data in-
  bines knowledge graph and neural networks.                        volves unconstrained concepts and relationships with mean-
                                                                    ing from different contextual dimensions of the content such
                        Introduction                                as religion, history and politics (Kursuncu et al. 2019a;
                                                                    Kursuncu 2018). Further, the limited number of labeled in-
Data-driven bottom-up machine/deep learning (ML) and
                                                                    stances available for training may fail to represent the true
top-down knowledge-driven approaches to creating reliable
                                                                    nature of concepts and relationships in data sets, leading to
models, have shown remarkable success in specific areas,
                                                                    ambiguous or sparse true signals (4) the lack of information
such as search, speech recognition, language translation,
                                                                    traceability for model explainability, (5) the coverage of in-
computer vision, and autonomous vehicles. On the other
                                                                    formation specific to a domain that would be missed other-
hand, they have had limited success in understanding and
                                                                    wise, (6) the complexity of model architecture in time and
deciphering contextual information, such as detection of ab-
                                                                    space1 , and (7) false alarms in model performance. Conse-
stract concepts in online/offline human interactions. Current
                                                                    quently, we believe standard separate knowledge graph KG
   ∗
     Equally Contributed                                            and ML methods are vulnerable to deduce or learn spurious
Copyright c 2020 held by the author(s). In A. Martin, K. Hinkel-    concepts and relationships that appear deceptively good on
mann, H.-G. Fill, A. Gerber, D. Lenat, R. Stolle, F. van Harmelen   a KG or training datasets, yet do not provide adequate re-
(Eds.), Proceedings of the AAAI 2020 Spring Symposium on Com-
bining Machine Learning and Knowledge Engineering in Practice
                                                                       1
(AAAI-MAKE 2020). Stanford University, Palo Alto, California,            https://www.theguardian.com/commentisfree/2019/nov/
USA, March 23-25, 2020. Use permitted under Creative Commons        16/can-planet-afford-exorbitant-power-demands-of-machine-
License Attribution 4.0 International (CC BY 4.0).                  learning
sults when the data set contains contextual and dynamically     of which utilize external knowledge before/after the repre-
changing concepts and relations.                                sentation has been generated.
   In this position paper, we describe innovations that will
operationalize more abstract models built upon the charac-      Neural Language Models (NLMs)
teristics of a domain to render them computationally ac-        NLMs are a category of neural networks capable of learn-
cessible within neural network architectures. We propose        ing sequential dependencies in a sentence, and preserve such
a neuro-symbolic method, knowledge-infused learning that        information while learning a representation. In particular,
measures information loss in latent features learned by neu-    LSTM (Long Short Term Memory) networks (Hochreiter
ral networks through KGs with conceptual and structural         and Schmidhuber 1997) have emerged from the failure of
relationship information, for addressing the aforementioned     RNNs (Recurrent Neural Networks) in remembering long-
challenges. The infusion of knowledge during the represen-      term information. Concerning the loss of contextual infor-
tation learning phase raises the following central research     mation while learning, (Cho et al. 2014) proposed a context-
questions: (i) How do we decide whether to infuse knowl-        feed forward LSTM architecture in which context is learned
edge or not, at a particular stage while learning between       by the previous layer merged with forgetting and modula-
layers, and how to quantify knowledge to be infused? (ii)       tion gates of the next layer. However, if erroneous contex-
How to merge latent representations between layers with         tual information is learned in previous layers, it is difficult
external knowledge representations, and (iii) How to prop-      to correct (Masse, Grant, and Freedman 2018), which is a
agate the knowledge through the learned latent representa-      problem magnified by noisy data and content sparsity (e.g.
tion? Considering the future deployment of AI in applica-       Twitter, Reddit, Blogs).
tions, the potential impact of this approach is significant.       As the inclusion of structured knowledge (e.g., Knowl-
As stated in (Karpathy 2015), the deeper the network, the       edge Graphs) in deep learning, improves information re-
denser the representation and better the learning. A large      trieval (Sheth and Kapanipathi 2016), prior research has
number of parameters and the layered nature of neural net-      shown the significance of knowledge in the pursuit of im-
works make them modifiable based on specific problem            proving NLMs, such as in commonsense reasoning (Liu and
characteristics. However, the challenges (1, 3, 5 and 7) make   Singh 2004). The transformer NLMs such as BERT, (De-
neural networks vulnerable to the sudden appearance of          vlin et al. 2018) (including its variants BioBert and SciB-
relevant-but-sparse or ambiguous features, in often noisy       ERT), are still data dependent. BERT has been utilized in
big data (Valiant 2000; De Palma, Kiani, and Lloyd 2019;        hybrid frameworks such as (Scarlini, Pasini, and Navigli
Kursuncu et al. 2019b). On the other hand, KG-based ap-         2020) in the creation of sense embeddings using BabelNet
proaches structure search within a feature space defined by     and NASARI. (Liu et al. 2019a) proposed K-BERT, that en-
domain experts. To compensate for the vulnerability of the      riches the representations by injecting the triples from KGs
aforementioned challenges, incorporating knowledge to the       into the sentence. As this incorporation of knowledge for
learned representation in principled fashion is required. A     BERT takes place in the form of attention, we consider the
promising approach is to base this on a measurable discrep-     K-BERT as semi-deep infusion (Sheth et al. 2019). Simi-
ancy between the knowledge captured in the neural network       larly, ERNIE (Sun et al. 2019) incorporated external knowl-
and external resources.                                         edge to capture lexical, syntactic, and semantic information,
   Computational modeling coupled with knowledge infu-          enriching BERT.
sion in a neural network will disambiguate important con-
cepts defined in a KG with their different semantic meanings    Neural Attention Models (NAM)
through its structural relations. Knowledge infusion will re-   NAM (Rush, Chopra, and Weston 2015) highlights par-
define the emphasis of sparse but essential, and irrelevant     ticular features that are important for pattern recogni-
but frequently occurring terms and concepts, boosting re-       tion/classification based on a hierarchical architecture. The
call without reducing precision. Further, it will provide ex-   manipulation of attentional focus is effective in solving real
planatory insight into the model, robustness to noise and re-   world problems involving massive amounts of data (Halevy,
duce dependency on frequency in the learning process. This      Norvig, and Pereira 2009; Sun et al. 2017). On the other
neuro-symbolic learning approach will potentially transform     hand, some applications demonstrate the limitation of at-
existing methods for data analysis and building computa-        tentional manipulation in a set of problems such as sen-
tional models. While the impact of this approach is trans-      timent (mis)classification (Maurya 2018) and suicide risk
ferable (and replicable) to a majority of domains, the ex-      (Corbitt-Hall et al. 2016), where feature presence is inher-
plicit implications are particularly apparent for social sci-   ently ambiguous, just as in the online radicalization prob-
ence (Kursuncu et al. 2019a) and healthcare domains (Gaur       lem (Kursuncu et al. 2019a). For example, in the suicide
et al. 2018).                                                   risk prediction task, references to suicide-related terminol-
                                                                ogy appear in social media posts of both victims as well
                     Related Work                               as supportive listeners, and the existing NAMs fail to cap-
As the incorporation of knowledge has been explored in var-     ture semantic relations between terms that help differenti-
ious forms in prior research, in this section, we describe      ate the suicidal user from a supportive user (Gaur et al.
the methodologies and applications specifically related to      2019). To overcome such limitations in a sentiment clas-
knowledge-infused learning: Neural language models, neu-        sification task, (Vo et al. 2017) adds sentiment scores into
ral attention models, knowledge based neural networks, all      the feature set for enhancing the learned representation and
modifies the loss function to respond to values of the sen-
timent score during learning. However, (Sheth et al. 2017;
Kho et al. 2019) have pointed out the importance of using
domain-specific knowledge especially in cases where the
problem is complex in nature (Perera et al. 2016). (Bian,
Gao, and Liu 2014) has empirically demonstrated the effec-
tiveness of combining richer semantics from domain knowl-
edge with morphological and syntactic knowledge in the
text, by modeling knowledge as an auxiliary task that reg-
ularizes the learning of the main objective in a deep neural
network.

Knowledge-based Neural Networks
(Yi et al. 2018) introduced a knowledge-based, recurrent at-
tention neural network (KB-RANN) that modifies the at-
tentional mechanism by incorporating domain knowledge
to improve model generalization. However, their domain-
knowledge is statistically derivable from the input data itself
and is analogous to merely learning an interpolation function
over the existing data. (Dugas et al. 2009) proposed a modifi-
cation in the neural network by adopting Lipschitz functions
for its activation function. (Hu et al. 2016) proposed a com-
bination of deep neural networks with logic rules by employ-      Figure 1: Contextual Dimension Modeling Diagram (Kur-
ing knowledge distillation procedure (Hinton, Vinyals, and        suncu et al. 2019a) Embedding algorithm above (W2V:
Dean 2015) of transferring the learned tacit knowledge from       Word2Vec) can be replaced by other algorithms such as
larger neural network, to the weights of the smaller neural       BERT. For each dimension, a specific corpus is utilized to
network in data-limited settings. These studies for incorpo-      create the model and the generated representation of content
rating knowledge in a deep learning framework have not in-        is concatenated. Generating the three contextual dimension
volved declarative knowledge structures in the form of KGs        representations of a social media post will emphasize the
(e.g., DBpedia) (Chen et al. 2019). However, (Casteleiro et       weights of such essential lexical cues.
al. 2018) recently showed how the Cardiovascular Disease
Ontology (CDO) provided context and reduced ambiguity,
improving performance on a synonym detection task. (Shen          Knowledge Graphs
et al. 2018) employed embeddings of entities in a KG, de-         A Knowledge graph (KG) is a conceptual model of a do-
rived through Bi-LSTMs, to enhance the efficacy of NAMs.          main that stores and structures declarative knowledge in a
(Sarker et al. 2017) presented a conceptual framework for         human and machine-readable format, constituting factual
explaining artificial neural networks’ classification behavior    ground truth and embodying a domain ontology of objects,
using background knowledge on the semantic web. (Makni            attributes, and relations. KGs rely on symbolic propositions,
and Hendler ) explained a deep learning approach to learn         employing generic conceptual relationships in taxonomies,
RDFS2 rules from both synthetic and real-world seman-             partonomies and specific content with labeled links. Exam-
tic web data. They also claim their approach improves the         ples include DBpedia, UMLS, and ICD-10. The factual in-
noise-tolerance capabilities of RDFS reasoning.                   formation about the domain is represented in the form of
   All of the frameworks in the above subsections utilized        instances (or individuals) of those concepts (or classes) and
external knowledge before or after the representation has         relationships (Gruber 2008; Sheth and Thirunarayan 2012).
been generated by NAMs, rather than within the deep neural        Therefore, a domain can be described or modeled through
network as in our approach (Sheth et al. 2019). We propose        KGs in a way that both computers and humans can under-
a learning framework that infuses domain knowledge within         stand. As KGs differentiate contextual nuances of concepts
the latent layers of neural networks for modeling.                in the content, they play a key role in our framework with
                                                                  extensive use by several functions.
                       Preliminaries
Symbolic representation of a domain, besides its probabilis-      Contextual Modeling
tic representation, is crucial for neuro-symbolic learning. In    Capturing contextual cues in the language is crucial in our
our approach, we propose to homogenize symbolic informa-          approach; hence, we utilize NLMs to generate embeddings
tion from KGs (see Section Knowledge Graphs) and contex-          of the content. Recent embedding algorithms have emerged
tual neural representations (see Section Contextual Model-        to create such representations such as Word2Vec (Goldberg
ing), in neural networks.                                         and Levy 2014), GLoVe (Pennington, Socher, and Manning
                                                                  2014), FastText (Athiwaratkun, Wilson, and Anandkumar
   2
       https://www.w3.org/2001/sw/wiki/RDFS                       2018) and BERT (Devlin et al. 2018).
   Modeling context-sensitive problems in different domains          an existing deep learning framework.
(e.g., healthcare, cyber social threats, online extremism and           We propose to further develop an innovative deep
harassment), depends heavily on carefully designed features          knowledge-infused learning approach that will reveal pat-
to extract meaningful information, based on characteristics          terns that are missed by traditional approaches because of
of the problems and a ground truth dataset. Moreover, iden-          sparse feature occurrence, feature ambiguity and noise. This
tifying these characteristics and differentiating the content        approach will support the following integrated aims: (i) In-
requires different levels of granularity in the organization of      fusion of Declarative Domain Knowledge in a Deep Learn-
features. For instance, in the problem of online Islamist ex-        ing framework, and (ii) Optimal Sub-Knowledge Graph
tremism, the information being shared in social media posts          Creation and Evolution. The overall architecture in Figure
by users in extremism-related social networks displays an            2 guides our proposed research on these two aims. Our
intent that depends on the user’s type (e.g., recruiter, fol-        methods will disambiguate important concepts defined in
lower). Hence, as these user types show different character-         the respective KGs with their different semantic meanings
istics (Kursuncu et al. 2018), for reliable analysis, it is criti-   through its structural relations. Knowledge infusion will re-
cal to consider different contextual dimensions (Kursuncu et         define the emphasis of sparse-but-essential, and irrelevant-
al. 2019a; Kursuncu 2018). Moreover, the ambiguity of di-            but-frequently-occurring terms and concepts, boosting recall
agnostic terms (e.g., jihad) also mandates representation of         without reducing precision.
terms in different contexts. Hence, to better reflect these dif-
ferences, creating multiple models enables us to represent           Knowledge-Infused Learning
the multiple contextual dimensions for a reliable analysis.
                                                                     Each layer in a neural network architecture produces a la-
Figure 1 details the contextual dimension modeling work-
                                                                     tent representation of the input vector (ht ). The infusion of
flow.
                                                                     knowledge during the representation learning phase raises
                                                                     the following central research questions: R1: How do we
      A Proposed Comprehensive Approach                              decide whether to infuse knowledge or not, at a particular
Although the existing research (Gaur et al. 2018; Bhatt et           stage while learning between layers, and how to quantify
al. 2018a) shows the contribution of incorporating external          knowledge to be infused? R2: How to merge latent repre-
knowledge in ML, this incorporation mostly takes place be-           sentations between layers with external knowledge represen-
fore or after the actual learning process (e.g., feature extrac-     tations, and R3: How to propagate the knowledge through
tion, validation); thus remaining shallow. We believe that           the learned latent representation? We propose to define two
deep knowledge infusion, within the hidden layers of neu-            functions to address these two questions: Knowledge-Aware
ral networks, will greatly improve the performance by: (i)           Loss Function (K-LF) and Knowledge Modulation Function
reducing false alarm and information loss, (ii) boosting re-         (K-MF), respectively.
call without sacrificing precision, (iii) providing finer gran-         Configurations of neural networks can be designed in var-
ular representation, (iv) enabling explainability (Islam et al.      ious ways depending on the problem. As our aim is to in-
2019; Kursuncu et al. 2019c) and (v) reducing bias. Specifi-         fuse knowledge within the neural network, such an opera-
cally, we believe that it will become a critical and integral        tion can take place (i) before the output layer (e.g., Soft-
component of AI models that are integrated in deployed               Max), (ii) between hidden layers (e.g., reinforcing the gates
tools, e.g, in healthcare, where domain knowledge is cru-            of an NLM layer, modulating the hidden states of NLM lay-
cial and indispensable in decision making processes. For-            ers, Knowledge-driven NLM dropout and recurrent dropout
tunately, these domains are rich in terms of their respec-           between layers). To illustrate (i), we describe our initial ap-
tive machine-readable knowledge resources, such as manu-             proach to neural language models that infuses knowledge
ally curated medical KGs (e.g., UMLS (McInnes, Pedersen,             before the output layer, which we believe will shed the light
and Pakhomov 2009), ICD-10 (Brouch 2000) and DataMed                 towards a reliable and robust solution with more research
(Ohno-Machado et al. 2017)). In our prior research (Gaur             and rigorous experimentations.
et al. 2018), we utilized ML models coupled with these
KGs to predict mental health disorders among 20 Mental               Seeded Sub-Knowledge Graph The Seeded Sub-
Disorders (defined in the DSM-5) for Reddit posts. Typical           Knowledge Graph, is a subset of KGs, which participate
approaches for such predictions employ word embeddings,              broadly in our technical approach. Generic KGs (e.g.,
such as Word2Vec, resulting in sub-optimal performance               DBpedia (Bizer et al. 2009), YAGO2 (Hoffart et al. 2013),
when they are used in domain-specific tasks. We have in-             Freebase (Bollacker et al. 2008)) may contain over a
corporated knowledge into the embeddings of Reddit posts             million entities and close to a billion relationships. Using
by (i) using Zero Shot learning (Palatucci et al. 2009), (ii)        the entire graph of linked data on the web can cause; (1)
modulating (e.g., re-weighting) their embeddings, similar            unnecessary computation and (2) noise due to irrelevant
to NAMs, and obtained a significant reduction in the false           knowledge, and has sometimes failed to benefit intelli-
alarm rate, from 13% (without knowledge) to 2.5% (with               gent application (Roy, Park, and Pan 2017). However,
knowledge). In another study, we have leveraged the domain           real-world problems are domain-specific and require only
knowledge in KGs to validate model weights that explain di-          a relevant (sub) portion of the full graph. Creation of a
verse crowd behavior in the Fantasy Premier League partic-           Seeded Sub-KG (Lalithsena 2018) based on a ground truth
ipants (FPL) (Bhatt et al. 2018b). However, very little previ-       dataset is needed, to represent a particular domain using
ous work has tried to integrate such functional knowledge to         information-theoretic approaches (e.g., KL divergence)
Figure 2: Overall Architecture: Contextual representations of data are generated, and domain knowledge amplifies the signifi-
cance of specific important concepts that are missed in the learning model. Classification error determines the need for updating
a Seeded SubKG with more relevant knowledge, resulting in a Seeded SubKG that is more refined and informative to our
model.


and probabilistic soft logic (Kimmig et al. 2012). Further,        output layer. The output layer (e.g., SoftMax) of the NLM
a sub-graph discovery approach (Cameron et al. 2015;               model estimates the error to be back-propagated. As the
Lalithsena 2018) can also be used utilizing probabilistic          techniques for knowledge infusion between hidden layers or
graphical models (e.g., deep belief networks, conditional          just before the output layers will be explored, in this sub-
random fields). In our approach, the Seeded SubKG will be          section, we explain the Knowledge Infusion Layer (K-IL)
updated with more knowledge based on difference between            which takes place just before the output layer.
the learned representation and relevant knowledge repre-
sentation from the KG (see Section Differential Knowledge          Algorithm 1 Routine for Infusion of Knowledge in NLMs
Engine).                                                            1: procedure K NOWLEDGE I NFUSION
Ke : Knowledge Embedding Creation Representation of                 2:    Data : N LMtype , #Epochs, #Iter, Ke
                                                                                     −−→
knowledge in the Seeded SubKG will be generated as                  3:    Output : MT
embedding vectors. Specific contextual dimension models             4:    for ne=1 to #Epochs do
                                                                              −→ −−−→
and/or more generic models can be utilized to create an em-         5:        hT , hT −1 ← TrainingNLM(N LMT ype ,#Iter)
                                                                                          −−−→ −  →           −
                                                                                                              → −  →
bedding of each concept and their relations in the Seeded           6:        while (DKL (hT −1 ||Ke ) − DKL (hT ||Ke ) > ) do
SubKG. Unlike traditional approaches that compute the rep-                                        −→ −   →
                                                                    7:            hT ← σ(Whk ∗ (hT ⊕ Ke ) + bhk )
resentation of each concept in the KGs by simply taking an          8:            W hk ← W hk - ηk ∇(K-LF)
average of embedding vectors of concepts, we leverage the                     −−→      −
                                                                                       →
                                                                    9:        MT ← hT W hk
existing structural information of the graph. This procedure                       −−→
                                                                   10:    return: MT
is formally defined:
                         X            O
                 Ke =      [Ci , Cj ]   Dij              (1)          Algorithm 1 takes the type of neural language model,
                         ij
                                                                   number of epochs, iterations and the seeded knowledge
                                                                   graph embedding Ke as input, and returns a knowledge in-
   where Ke is the representation of the concepts enriched
                                                                   fused representation of the hidden state MT . In line 4, the
by the relationships in the Seeded-KG, (Ci , Cj ) is the rele-
                                                                   infusion of knowledge takes place after each epoch without
vant pair of concepts in the Seeded-KG, Dij is the distance
                                                                   obstructing the learning of the vanilla NLM model and is ex-
measure (e.g., Least Common Subsumer (Baader, Sertkaya,
                                                                   plained in lines 5-10. Within the knowledge infusion process
and Turhan 2007)) between the two concepts Ci and Cj .
                                                                   (lines 7-9), we optimize the loss function in equation 2 with
Novel methods will further be examined building upon this
                                                                   convergence condition defined as the reduction in the dif-
initial approach above as well as existing tools that include
                                                                   ference between the DKL of hT and hT−1 in the presence
TRANS-E (Bordes et al. 2013), TRANS-H (Wang et al.
                                                                   of Ke . Considering the vanilla structure of a NLM (Greff et
2014), and HOLE (Nickel et al. 2016) for the creation of
                                                                   al. 2017), MT is utilized by the fully connected layer for
embeddings from KGs.
                                                                   classification.
Knowledge Infusion Layer In a many-to-one NLM                         To illustrate an initial approach in Figure 3, we use
(Shivakumar et al. 2018) network with T hidden layers,             LSTMs as NLMs in our neural network. K-IL functions add
the Tth layer contains the learned representation before the       an additional layer before the output layer of our proposed
                                                                    maximize the information gain from the knowledge repre-
                                                                    sentation (e.g., Ke ). We will compute differential knowl-
                                                                    edge (∇K-LF) through such optimization approach; thus,
                                                                    the computed differential knowledge will also determine the
                                                                    degree of knowledge to be infused. ∇K-LF will be com-
                                                                    puted in the form of embedding vectors, and the dimensions
                                                                    from Ke will be preserved.
                                                                    K-MF: Knowledge Modulation Function We need to
                                                                    merge the differential knowledge representation with the
                                                                    partially learned representation. However, this operation
                                                                    cannot be done arbitrarily as the vector spaces of both repre-
                                                                    sentations are different both in dimension and distribution if
Figure 3: Inner Mechanism of the Knowledge Infusion                 not same (Dumančić and Blockeel 2017). We explain an ini-
Layer                                                               tial approach for the K-MF to modulate the learned weight
                                                                    matrix of the neural network with the hidden vector through
                                                                    an appropriate operation (e.g., Hadamard pointwise multi-
neural network architecture. This layer takes the latent vec-       plication). This operation at the Tth layer can be formulated
tor (hT−1 ) of the penultimate layer, the latent vector of the      as:
last hidden layer (hT ) and the knowledge embedding (Ke ),
                                                                       Equation for W hk = W hk − ηk ∗ ∇K-LF, where W hk is
as input.
                                                                    the learned weight matrix infusing knowledge, ηk is learn-
   In this layer, we define two particular functions that will
                                                                    ing momentum (Sutskever et al. 2013), ∇K-LF is differ-
be critical for merging the latent vectors from the hidden
                                                                    ential knowledge. The weight matrix (W hk ) is computed
layers and the knowledge embedding vector from the KG.
                                                                    through the learning epochs utilizing the differential knowl-
Note that the dimensions of these vectors are the same be-
                                                                    edge embedding (∇K-LF). Then we merge W hk with the
cause they are created from the same models (e.g., contex-
                                                                    hidden vector hT through the K-MF. Considering that we
tual models), which makes the merge operation of those vec-
                                                                    use Hadamard pointwise multiplication as our initial ap-
tors possible and valid.
                                                                    proach, we formally define the output MT of K-MF as:
K-LF: Knowledge-Aware Loss Function In neural net-                     This operation at the Tth layer can be formulated as:
works, hidden layers may de-emphasize important patterns
due to the sparsity of certain features during learning, which                           M~T = h~T     W hk                    (3)
causes information loss. In some cases, such patterns may
                                                                       where MT is Knowledge-Modulated representation, hT
not even appear in the data. However, such relations or pat-
                                                                    is the hidden vector and W hk is the learned weight matrix
terns may be defined in KGs with even more relevant knowl-
                                                                    infusing knowledge. Further investigations of techniques for
edge. We call this information gap between the learned rep-
                                                                    K-MF constitutes a central research topic for the research
resentation of the data and knowledge representation as dif-
                                                                    community.
ferential knowledge. Information loss in a learning process
is relative to the distribution that suffered the loss. Hence, we
                                                                    Differential Knowledge Engine
propose a measure to determine the differential knowledge
and guide the degree of knowledge infusion in learning. As          In deep neural networks, each epoch generates an error that
our initial approach to this measure, we developed a two-           is back-propagated until the model reaches a saddle point
state regularized loss function by utilizing Kullback Leibler       in the local minima, and the error is reduced in each epoch.
(KL) divergence. Our choice of KL divergence measure is             The error indicates the difference between probabilities of
largely influenced by the Markov assumptions made in lan-           actual and predicted labels, and this difference can be used
guage modeling and have been highlighted in (Longworth              to enrich the Seeded SubKG in our proposed knowledge-
2010). The K-LF measure estimates the divergence between            infused learning (K-IL) framework.
the hidden representations (hT−1 , hT ) and knowledge rep-             In this subsection, we discuss the sub-knowledge graph
resentation (Ke ), to determine the differential knowledge to       operations that are based on the difference between the
be infused.                                                         learned representation of our knowledge-infused model
   Formally we define it as:                                        (MT ), and the representation of the relevant sub-knowledge
arg min(hT~−1 , h~T , K ~ e ) ≡ K-LF, where hT−1 is an input        graph from the KG, which we name the differential sub-
for convergence constraint.                                         knowledge graph. We define a Knowledge Proximity func-
                                                                    tion to generate the Differential Sub-knowledge Graph,
                                                                    and Update Seeded SubKG to insert the differential sub-
                    K-LF = min DKL (h~T ||K    ~ e );               knowledge graph into the Seeded SubKG.
                                                             (2)
           s.t.DKL (h~T ||K
                          ~ e ) < DKL (hT~−1 ||K~e )                Knowledge Proximity Upon the arrival of the learned
                                                                    representation from the knowledge-infused learning model,
                                                                    we query the KG for retrieving related information to the
  We minimize the relative entropy for information loss to          respective data point. In this particular step, it is important
to find the optimal proximity between the concept and its         predictive analysis of online communications such as misin-
related concepts. For example, from the “South Carolina”          formation and extremism, conversational modeling, and dis-
concept, we may traverse the surrounding concepts with a          ease prediction.
varying number of hops (empirically decided). Finding the            As predicting online extremism is challenging and false
optimal number of hops towards each direction from the            alarms create serious implications potentially affecting mil-
concept in question is still an open research question. As        lions of individuals, (Kursuncu et al. 2019a) showcased that
we find optimal proximity of a particular concept in the KG,      the (shallow) infusion of external domain-specific knowl-
we propagate KG based on the proximity starting from the          edge improves precision, reducing potential social discrim-
concept in question.                                              ination. Further, in prediction of mental health diseases de-
Differential SubKG Once we obtain the SubKG from the              fined in DSM-5, (Gaur et al. 2018) shallow knowledge infu-
graph propagation, we create a differential SubKG that will       sion reduces false alarms by 30%. On the other hand, conver-
reflect the difference in knowledge from the Seeded SubKG.        sational models pose an important application area as (Liu
For this procedure, research is needed to formulate the prob-     et al. 2019b) proposed a conversation framework where the
lem using variational autoencoders to extract a differential      fusion of KGs and text mutually reinforce each other to gen-
subKG(Dkg ) and, we believe it will provide missing infor-        erate knowledge-aware responses, improving the model in
mation in the Seeded-KG.                                          generalizability and explainability. In another study, (Young
                                                                  et al. 2018) integrated commonsense knowledge into the
Update function The differential subKG generated as a             conversational models selecting the most appropriate re-
result of minimizing knowledge proximity is considered as         sponse. While machine learning finds many application ar-
an input factual graph to the update procedure. As a re-          eas in medicine for disease prediction, large data is not al-
sult, the procedure dynamically evolves the Seeded subKG          ways available. In this case knowledge-infused learning gen-
with missing information from differential subKG. We pro-         erates more representative features thereby avoiding overfit-
pose to utilize Lyapunov stability theorem (Liu, Zhang, and       ting. A study (Tan et al. 2019) on early diagnosis of lung
Chen 2014) and Zero Shot learning to update the Seeded-           cancer using computed tomography images, infused knowl-
KG using Dkg . Dkg and Seeded-KG represent two knowl-             edge in the form of expert-curated features into the learn-
edge structures requiring a process of transferring the knowl-    ing process through CNN. Despite the small data set, the
edge from one structure to another (Hamaguchi et al. 2017).       enriched feature space in their knowledge-infused learning
We define this process as generation of semantic mapping          process improved sensitivity and specificity of the model.
weights that encodes and decodes the two semantic spaces,            In contrast to the applications above, we believe that the
utilizing the Lyapunov stability constraint and Sylvester op-     deep infusion of external knowledge within latent layers will
timization approach: Given two semantic spaces belonging          enhance the coverage of the information being learned by
to a domain D (e.g., online extremism, mental health), we         the model based on KGs. Hence, this will provide better
tend to attain an equilibrium position defined as:                generalizability, reduction in bias and false alarms, disam-
                                                                  biguation, less reliance on large data, explainability, reliabil-
                                                                  ity and robustness, to the real world applications in critical
    ||Skg − W ∗ Dkg ||2F = α ∗ ||W ∗ Skg − Dkg ||2F        (4)    aforementioned domains with significant impact.
   || . ||F represents Frobenius norm and α is a proportional-
ity constant belong to R. Equation 4 reflects Lyapunov sta-                               Conclusion
bility theorem and to achieve such a stable state we define
our optimization function as follows:                             Combining deep learning and knowledge graphs in a hybrid
                                                                  neural-symbolic learning framework will further enhance
  L = min(||Skg − W Dkg ||2F − α ∗ ||W Skg − Dkg ||2F ),          performance and accelerate the convergence of the learn-
                                                                  ing processes. Specifically, the impact of this improvement
                                     α > 0, W ∈ RXR
                                                                  in very sensitive domains such as health and social science,
                                                       (5)        will be significant with respect to their implications for real-
   Equation 5 is solvable using Sylvester optimization and        world deployment. Adoption of the tools that automate tasks
its derivation is defined in a recent study (Gaur et al. 2018).   that require knowledge and intelligence, and are traditionally
                                                                  done by humans, will improve with the help of this frame-
                                                                  work that marries deep learning and knowledge graph tech-
                Applications for K-IL                             niques. Specifically, we envision that the infusion of knowl-
Artificial intelligence models will be widely deployed in real    edge as described in this framework will capture information
world decision making processes in the foreseeable future,        for the corresponding domain in finer granularity of abstrac-
once the challenges described in Section 1, are overcome. As      tion. We believe that this approach will provide reliable so-
we argue that the incorporation of external structured knowl-     lutions to the problems faced in deep learning, as described
edge will address these challenges, it will benefit various       in Sections 1 and 5. Hence, in real world applications, re-
application domains such as social and health sciences, au-       solving these issues with both knowledge graphs and deep
tomating processes that require knowledge and intelligence.       learning in a hybrid neuro-symbolic framework will greatly
Specifically, it will have a potentially significant impact on    contribute to fulfilling AI’s promise.
                   Acknowledgement                                Chen, B.; Hao, Z.; Cai, X.; Cai, R.; Wen, W.; Zhu, J.; and
We acknowledge partial support from the National Science          Xie, G. 2019. Embedding logic rules into recurrent neural
Foundation (NSF) award CNS-1513721: “Context-Aware                networks. IEEE Access 7:14938–14946.
Harassment Detection on Social Media". Any opinions, con-         Cheng, J. 2018. Ai reasoning systems: Pac and applied
clusions or recommendations expressed in this material are        methods. arXiv preprint arXiv:1807.05054.
those of the authors and do not necessarily reflect the views     Cho, K.; Van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.;
of the NSF.                                                       Bougares, F.; Schwenk, H.; and Bengio, Y. 2014. Learning
                                                                  phrase representations using rnn encoder-decoder for statis-
                        References                                tical machine translation. arXiv preprint arXiv:1406.1078.
                                                                  Corbitt-Hall, D. J.; Gauthier, J. M.; Davis, M. T.; and Witte,
Athiwaratkun, B.; Wilson, A. G.; and Anandkumar, A.
                                                                  T. K. 2016. College students’ responses to suicidal content
2018. Probabilistic fasttext for multi-sense word embed-
                                                                  on social networking sites: an examination using a simulated
dings. arXiv preprint arXiv:1806.02901.
                                                                  facebook newsfeed. Suicide and Life-Threatening Behavior
Baader, F.; Sertkaya, B.; and Turhan, A.-Y. 2007. Comput-         46(5):609–624.
ing the least common subsumer wrt a background terminol-          De Palma, G.; Kiani, B.; and Lloyd, S. 2019. Random deep
ogy. Journal of Applied Logic 5(3):392–420.                       neural networks are biased towards simple functions. In
Bhatt, S.; Gaur, M.; Bullemer, B.; Shalin, V.; Sheth, A.;         Advances in Neural Information Processing Systems, 1962–
and Minnery, B. 2018a. Enhancing crowd wisdom using               1974.
explainable diversity inferred from social media. In 2018         Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2018.
IEEE/WIC/ACM International Conference on Web Intelli-             Bert: Pre-training of deep bidirectional transformers for lan-
gence (WI), 293–300. IEEE.                                        guage understanding. arXiv preprint arXiv:1810.04805.
Bhatt, S.; Gaur, M.; Bullemer, B.; Shalin, V. L.; Sheth, A. P.;   Dugas, C.; Bengio, Y.; Bélisle, F.; Nadeau, C.; and Gar-
and Minnery, B. 2018b. Enhancing crowd wisdom us-                 cia, R. 2009. Incorporating functional knowledge in
ing explainable diversity inferred from social media. In          neural networks. Journal of Machine Learning Research
IEEE/WIC/ACM International Conference on Web Intelli-             10(Jun):1239–1262.
gence. Santiago, Chile: IEEE.                                     Dumančić, S., and Blockeel, H. 2017. Demystifying rela-
Bian, J.; Gao, B.; and Liu, T.-Y. 2014. Knowledge-powered         tional latent representations. In International Conference on
deep learning for word embedding. In Joint European con-          Inductive Logic Programming, 63–77. Springer.
ference on machine learning and knowledge discovery in            Gaur, M.; Kursuncu, U.; Alambo, A.; Sheth, A.; Daniu-
databases, 132–148. Springer.                                     laityte, R.; Thirunarayan, K.; and Pathak, J. 2018. " let me
Bizer, C.; Lehmann, J.; Kobilarov, G.; Auer, S.; Becker, C.;      tell you about your mental health!" contextualized classifi-
Cyganiak, R.; and Hellmann, S. 2009. Dbpedia-a crystal-           cation of reddit posts to dsm-5 for web-based intervention.
lization point for the web of data. Web Semantics: science,       Gaur, M.; Alambo, A.; Sain, J. P.; Kursuncu, U.;
services and agents on the world wide web 7(3):154–165.           Thirunarayan, K.; Kavuluru, R.; Sheth, A.; Welton, R.; and
Bollacker, K.; Evans, C.; Paritosh, P.; Sturge, T.; and Taylor,   Pathak, J. 2019. Knowledge-aware assessment of severity of
J. 2008. Freebase: a collaboratively created graph database       suicide risk for early intervention. In The World Wide Web
for structuring human knowledge. In Proceedings of the            Conference, 514–525.
2008 ACM SIGMOD international conference on Manage-               Goldberg, Y., and Levy, O. 2014. word2vec explained: de-
ment of data, 1247–1250. AcM.                                     riving mikolov et al.’s negative-sampling word-embedding
Bordes, A.; Usunier, N.; Garcia-Duran, A.; Weston, J.; and        method. arXiv preprint arXiv:1402.3722.
Yakhnenko, O. 2013. Translating embeddings for model-             Greff, K.; Srivastava, R. K.; Koutník, J.; Steunebrink, B. R.;
ing multi-relational data. In Advances in neural information      and Schmidhuber, J. 2017. Lstm: A search space odyssey.
processing systems, 2787–2795.                                    IEEE transactions on neural networks and learning systems
Brouch, K. L. 2000. Where in the world is icd-10? Where           28(10):2222–2232.
in the World Is ICD-10?/AHIMA, American Health Informa-           Gruber, T. 2008. Ontology, encyclopedia of database sys-
tion Management Association.                                      tems, ling liu and m. Tamer Özsu (Eds.).
Cameron, D.; Kavuluru, R.; Rindflesch, T. C.; Sheth, A. P.;       Halevy, A.; Norvig, P.; and Pereira, F. 2009. The unreason-
Thirunarayan, K.; and Bodenreider, O. 2015. Context-driven        able effectiveness of data. IEEE Intelligent Systems 24(2):8–
automatic subgraph creation for literature-based discovery.       12.
Journal of biomedical informatics 54:141–157.                     Hamaguchi, T.; Oiwa, H.; Shimbo, M.; and Matsumoto, Y.
Casteleiro, M. A.; Demetriou, G.; Read, W.; Prieto, M.            2017. Knowledge transfer for out-of-knowledge-base en-
J. F.; Maroto, N.; Fernandez, D. M.; Nenadic, G.; Klein, J.;      tities: a graph neural network approach. arXiv preprint
Keane, J.; and Stevens, R. 2018. Deep learning meets on-          arXiv:1706.05674.
tologies: experiments to anchor the cardiovascular disease        Hinton, G.; Vinyals, O.; and Dean, J. 2015. Distill-
ontology in the biomedical literature. Journal of biomedical      ing the knowledge in a neural network. arXiv preprint
semantics 9(1):13.                                                arXiv:1503.02531.
Hochreiter, S., and Schmidhuber, J. 1997. Long short-term         Wang, P. 2019a. K-bert: Enabling language representation
memory. Neural computation 9(8):1735–1780.                        with knowledge graph. arXiv preprint arXiv:1909.07606.
Hoffart, J.; Suchanek, F. M.; Berberich, K.; and Weikum, G.       Liu, Z.; Niu, Z.-Y.; Wu, H.; and Wang, H. 2019b. Knowl-
2013. Yago2: A spatially and temporally enhanced knowl-           edge aware conversation generation with explainable rea-
edge base from wikipedia. Artificial Intelligence 194:28–61.      soning over augmented graphs. In Proceedings of the
Hu, Z.; Ma, X.; Liu, Z.; Hovy, E.; and Xing, E. 2016.             2019 Conference on Empirical Methods in Natural Lan-
Harnessing deep neural networks with logic rules. arXiv           guage Processing and the 9th International Joint Confer-
preprint arXiv:1603.06318.                                        ence on Natural Language Processing (EMNLP-IJCNLP),
                                                                  1782–1792.
Islam, S. R.; Eberle, W.; Bundy, S.; and Ghafoor, S. K. 2019.
Infusing domain knowledge in ai-based" black box" models          Liu, M.; Zhang, D.; and Chen, S. 2014. Attribute rela-
for better explainability with application in bankruptcy pre-     tion learning for zero-shot classification. Neurocomputing
diction. arXiv preprint arXiv:1905.11474.                         139:34–46.
Karpathy, A. 2015. The unreasonable effectiveness of re-          Longworth, C. 2010. Kernel methods for text-independent
current neural networks. Andrej Karpathy blog 21.                 speaker verification. Ph.D. Dissertation, University of Cam-
                                                                  bridge.
Kho, S. J.; Padhee, S.; Bajaj, G.; Thirunarayan, K.; and
Sheth, A. 2019. Domain-specific use cases for knowledge-          Makni, B., and Hendler, J. Deep learning for noise-tolerant
enabled social media analysis. In Emerging Research Chal-         rdfs reasoning.
lenges and Opportunities in Computational Social Network          Masse, N. Y.; Grant, G. D.; and Freedman, D. J. 2018. Alle-
Analysis and Mining. Springer. 233–246.                           viating catastrophic forgetting using context-dependent gat-
Kimmig, A.; Bach, S.; Broecheler, M.; Huang, B.; and              ing and synaptic stabilization. Proceedings of the National
Getoor, L. 2012. A short introduction to probabilistic soft       Academy of Sciences 115(44):E10467–E10475.
logic. In Proceedings of the NIPS Workshop on Probabilistic       Maurya, A. K. 2018. Learning low dimensional word based
Programming: Foundations and Applications, 1–4.                   linear classifiers using data shared adaptive bootstrap aggre-
Kursuncu, U.; Gaur, M.; Lokala, U.; Illendula, A.;                gated lasso with application to imdb data. arXiv preprint
Thirunarayan, K.; Daniulaityte, R.; Sheth, A.; and Arpinar,       arXiv:1807.10623.
I. B. 2018. " what’s ur type?" contextualized classification of   McInnes, B. T.; Pedersen, T.; and Pakhomov, S. V. 2009.
user types in marijuana-related communications using com-         Umls-interface and umls-similarity: open source software
positional multiview embedding. In IEEE/WIC/ACM Inter-            for measuring paths and semantic similarity. In AMIA An-
national Conference on Web Intelligence(WI’18).                   nual Symposium Proceedings, volume 2009, 431. American
Kursuncu, U.; Gaur, M.; Castillo, C.; Alambo, A.;                 Medical Informatics Association.
Thirunarayan, K.; Shalin, V.; Achilov, D.; Arpinar, I. B.;        Nickel, M.; Rosasco, L.; Poggio, T. A.; et al. 2016. Holo-
and Sheth, A. 2019a. Modeling islamist extremist com-             graphic embeddings of knowledge graphs. In AAAI, vol-
munications on social media using contextual dimensions:          ume 2, 3–2.
Religion, ideology, and hate. Proceedings of the ACM on           Ohno-Machado, L.; Sansone, S.-A.; Alter, G.; Fore, I.;
Human-Computer Interaction 3(CSCW):151.                           Grethe, J.; Xu, H.; Gonzalez-Beltran, A.; Rocca-Serra, P.;
Kursuncu, U.; Gaur, M.; Lokala, U.; Thirunarayan, K.;             Gururaj, A. E.; Bell, E.; et al. 2017. Finding useful data
Sheth, A.; and Arpinar, I. B. 2019b. Predictive analysis on       across multiple biomedical data repositories using datamed.
twitter: Techniques and applications. In Emerging Research        Nature genetics 49(6):816.
Challenges and Opportunities in Computational Social Net-         Olteanu, A.; Castillo, C.; Diaz, F.; and Kiciman, E. 2019.
work Analysis and Mining. Springer. 67–104.                       Social data: Biases, methodological pitfalls, and ethical
Kursuncu, U.; Gaur, M.; Thirunarayan, K.; and Sheth, A.           boundaries. Frontiers in Big Data 2:13.
2019c. Explainability of medical ai through domain knowl-         Palatucci, M.; Pomerleau, D.; Hinton, G. E.; and Mitchell,
edge. Ontology Summit 2019, Medical Explanation.                  T. M. 2009. Zero-shot learning with semantic output
Kursuncu, U. 2018. Modeling the Persona in Persua-                codes. In Advances in neural information processing sys-
sive Discourse on Social Media Using Context-aware and            tems, 1410–1418.
Knowledge-driven Learning. Ph.D. Dissertation, University         Pennington, J.; Socher, R.; and Manning, C. 2014. Glove:
of Georgia.                                                       Global vectors for word representation. In Proceedings of
Lalithsena, S. 2018. Domain-specific knowledge extraction         the 2014 conference on empirical methods in natural lan-
from the web of data. Ph.D. Dissertation, Wright State Uni-       guage processing (EMNLP), 1532–1543.
versity.                                                          Perera, S.; Mendes, P. N.; Alex, A.; Sheth, A. P.; and
Liu, H., and Singh, P. 2004. Commonsense reasoning                Thirunarayan, K. 2016. Implicit entity linking in tweets. In
in and over natural language. In International Conference         International Semantic Web Conference, 118–132. Springer.
on Knowledge-Based and Intelligent Information and Engi-          Roy, A.; Park, Y.; and Pan, S. 2017. Learning domain-
neering Systems, 293–306. Springer.                               specific word embeddings from sparse cybersecurity texts.
Liu, W.; Zhou, P.; Zhao, Z.; Wang, Z.; Ju, Q.; Deng, H.; and      arXiv preprint arXiv:1709.07470.
Rudin, C. 2019. Stop explaining black box machine learn-        Vo, K.; Pham, D.; Nguyen, M.; Mai, T.; and Quan, T.
ing models for high stakes decisions and use interpretable      2017. Combination of domain knowledge and deep learn-
models instead. Nature Machine Intelligence 1(5):206–215.       ing for sentiment analysis. In International Workshop on
Rush, A. M.; Chopra, S.; and Weston, J. 2015. A neural at-      Multi-disciplinary Trends in Artificial Intelligence, 162–
tention model for abstractive sentence summarization. arXiv     173. Springer.
preprint arXiv:1509.00685.                                      Wang, Z.; Zhang, J.; Feng, J.; and Chen, Z. 2014. Knowl-
Sarker, M. K.; Xie, N.; Doran, D.; Raymer, M.; and Hit-         edge graph embedding by translating on hyperplanes. In
zler, P. 2017. Explaining trained neural networks with          AAAI, volume 14, 1112–1119.
semantic web technologies: First steps. arXiv preprint          Yi, K.; Jian, Z.; Chen, S.; Chen, Y.; and Zheng, N. 2018.
arXiv:1710.04324.                                               Knowledge-based recurrent attentive neural network for
Scarlini, B.; Pasini, T.; and Navigli, R. 2020. Sensembert:     traffic sign detection. arXiv preprint arXiv:1803.05263.
Context-enhanced sense embeddings for multilingual word         Young, T.; Cambria, E.; Chaturvedi, I.; Zhou, H.; Biswas,
sense disambiguation. In Proc. of AAAI.                         S.; and Huang, M. 2018. Augmenting end-to-end dialogue
Shen, Y.; Deng, Y.; Yang, M.; Li, Y.; Du, N.; Fan, W.; and      systems with commonsense knowledge. In Thirty-Second
Lei, K. 2018. Knowledge-aware attentive neural network          AAAI Conference on Artificial Intelligence.
for ranking question answer pairs. In The 41st International
ACM SIGIR Conference on Research & Development in In-
formation Retrieval, 901–904. ACM.
Sheth, A., and Kapanipathi, P. 2016. Semantic filtering for
social data. IEEE Internet Computing 20(4):74–78.
Sheth, A., and Thirunarayan, K. 2012. Semantics empow-
ered web 3.0: managing enterprise, social, sensor, and cloud-
based data and services for advanced applications. Synthesis
Lectures on Data Management 4(6):1–175.
Sheth, A.; Perera, S.; Wijeratne, S.; and Thirunarayan,
K. 2017. Knowledge will propel machine understand-
ing of content: Extrapolating from current examples. arXiv
preprint arXiv:1707.05308.
Sheth, A.; Gaur, M.; Kursuncu, U.; and Wickramarachchi,
R. 2019. Shades of knowledge-infused learning for enhanc-
ing deep learning. IEEE Internet Computing 23(6):54–63.
Shivakumar, P. G.; Li, H.; Knight, K.; and Georgiou, P.
2018. Learning from past mistakes: Improving automatic
speech recognition output via noisy-clean phrase context
modeling. arXiv preprint arXiv:1802.02607.
Sun, C.; Shrivastava, A.; Singh, S.; and Gupta, A. 2017. Re-
visiting unreasonable effectiveness of data in deep learning
era. In Computer Vision (ICCV), 2017 IEEE International
Conference on, 843–852. IEEE.
Sun, Y.; Wang, S.; Li, Y.; Feng, S.; Chen, X.; Zhang, H.;
Tian, X.; Zhu, D.; Tian, H.; and Wu, H. 2019. Ernie: En-
hanced representation through knowledge integration. arXiv
preprint arXiv:1904.09223.
Sutskever, I.; Martens, J.; Dahl, G.; and Hinton, G. 2013.
On the importance of initialization and momentum in deep
learning. In International conference on machine learning,
1139–1147.
Tan, J.; Huo, Y.; Liang, Z.; and Li, L. 2019. Expert
knowledge-infused deep learning for automatic lung nod-
ule detection. Journal of X-ray science and technology
27(1):17–35.
Topol, E. J. 2019. High-performance medicine: the conver-
gence of human and artificial intelligence. Nature medicine
25(1):44–56.
Valiant, L. G. 2000. Robust logics. Artificial Intelligence
117(2):231–253.