Neuro-Symbolic Deductive Reasoning for Cross-Knowledge Graph Entailment Monireh Ebrahimia , Md Kamruzzaman Sarkera , Federico Bianchib , Ning Xiec , Aaron Eberharta , Derek Doranc , HyeongSik Kimd and Pascal Hitzlera a Department of Computer Science & Engineering, Kansas State University b Bocconi University c Department of Computer Science & Engineering, Wright State University d Bosch Research & Technology Center, North America Abstract A significant and recent development in neural-symbolic learning are deep neural networks that can reason over symbolic knowledge graphs (KGs). A particular task of interest is KG entailment, which is to infer the set of all facts that are a logical consequence of current and potential facts of a KG. Initial neural-symbolic systems that can deduce the entailment of a KG have been presented, but they are limited: current systems learn fact relations and entailment patterns specific to a particular KG and hence do not truly generalize, and must be retrained for each KG they are tasked with entailing. We propose a neural-symbolic system to address this limitation in this paper. It is designed as a differentiable end-to-end deep memory network that learns over abstract, generic symbols to discover entailment patterns common to any reasoning task. A key component of the system is a simple but highly effective normalization process for continuous representation learning of KG entities within memory networks. Our results show how the model, trained over a set of KGs, can effectively entail facts from KGs excluded from the training, even when the vocabulary or the domain of test KGs is completely different from the training KGs. Keywords deep learning, deductive reasoning, knowledge graph entailment, neuro-symbolic 1. Introduction For many years, reasoning has been tackled as the task of building systems capable of infer- ring new crisp symbolic logical rules. However, those traditional methods are too brittle to be applied to noisy automatically created KGs. [1] provides a taxonomy of noise types in web KGs with respect to its effects on reasoning and shows the detrimental impact of noises on the result of the traditional reasoners. With the recent revival of interest in artificial neural In A. Martin, K. Hinkelmann, H.-G. Fill, A. Gerber, D. Lenat, R. Stolle, F. van Harmelen (Eds.), Proceedings of the AAAI 2021 Spring Symposium on Combining Machine Learning and Knowledge Engineering (AAAI-MAKE 2021) - Stanford University, Palo Alto, California, USA, March 22-24, 2021. " monireh@ksu.edu (M. Ebrahimi); mdkamruzzamansarker@ksu.edu (M.K. Sarker); f.bianchi@unibocconi.it (F. Bianchi); xie.25@wright.edu (N. Xie); aaroneberhart@ksu.edu (A. Eberhart); derek.doran@wright.edu (D. Doran); Hyeongsik.Kim@us.bosch.com (H. Kim); hitzler@ksu.edu (P. Hitzler)  © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) networks, the more robust neural link prediction models have been applied vastly for the com- pletion of KGs. These methods [2, 3, 4, 5, 6] heavily rely on the subsymbolic representation of entities and relations learned through maximization of a scoring objective function over valid factual triples. Thus, the success of such models hinges primarily on the power of those subsymbolic representations in encoding the similarity/relatedness of entities and relations. Recent attempts have focused on neural multi-hop reasoners [7, 8] to equip the model to deal with more complex reasoning where multi-hop inference is required. More recently, a Neural Theorem Prover [9] has been proposed in an attempt to take advantage of both symbolic and sub-symbolic reasoning. Despite their success, the main restriction common to machine learning-based reasoners is that they are unable to recognize and generalize to different domains or tasks. This inherent limitation follows from both the representations used and the learning process. The major issue comes from the mere reliance of these models on the representation of entities learned during the training or in the pre-training phase stored in a lookup table. Consequently, these mod- els usually have difficulty to deal with out-of-vocabulary (OOV) entities. Although the OOV problem has been addressed in part in the natural language processing (NLP) domain by tak- ing advantage of character-level embedding [10], subword units [11], Byte-Pair-Encoding [12], learning embeddings on the fly by leveraging text descriptions or spelling [13], copy mecha- nism [14] or pointer networks [15], still these solutions are insufficient for transferring pur- poses for reasoning. [16] shows that the success of natural language inference (NLI) methods is heavily benchmark specific. An even greater source of concern is that reasoning in most of the above sub-symbolic approaches hinges more on the notion of similarity and geometric-based proximity of real-valued vectors (induction) as opposed to performing transitive reasoning (de- duction) over them. Nevertheless, recent years have seen some progress in zero-shot relation learning in sub-symbolic reasoning domain[17]. Zero-shot learning refers to the ability of the model to infer new relations where that relation has not been seen before in training set[18]. This generalization capability is still quite limited and fundamentally different from our work in terms of both methodology and purpose. Inspired by these observations, we take a different approach in this work by investigating the emulation of deductive symbolic reasoning using memory networks. Memory networks [19] are a class of learning models capable of conducting multiple computational steps over an explicit memory component before returning an answer. Their sequential nature corresponds, conceptually, to the sequential process underlying some deductive reasoning algorithms. The attention modeling corresponds to pulling only relevant information (logical axioms) necessary for the next reasoning step. Besides, as attention can be traced over the run of a memory network, we will furthermore get insights into the "reasoning" underlying the network output. This paper contributes a recipe involving a simple but effective KG triple normalization be- fore learning their representation within an end-to-end memory network. To perform logical inference in more abstract level, and thereby facilitating the transfer of reasoning expertise from one KG to another, the normalization maps entities and predicates in a KG to a generic vocabulary. Facts in additional KGs are normalized using the same vocabulary, so that the net- work does not learn to overfit its learning to entity and predicate names in a specific KG. This emulates symbolic reasoning by neural embeddings as the actual names (as strings) of entities from the underlying logic such as variables, constants, functions, and predicates are insub- stantial for logical entailment in the sense that a consistent renaming across a theory does not change the set of entailed formulas (under the same renaming). Thanks to the term-agnostic feature of our representation, we are able to create a reasoning system capable of performing reasoning over an unseen set of vocabularies in the test phase. Our contributions are threefold: (i) We present the construction of memory networks for emulating the symbolic deductive reasoning. (ii) We propose an optimization to this architec- ture using normalization approach to enhance their transfer capability. We show that in an unnormalized setting, they fail to perform well across KGs. (iii) We examine the efficacy of our model for cross-domain and cross-KG deductive reasoning. We also show the scalability of our model (in terms of reduced time and space complexity) for large datasets. Related Work On the issue of doing logical reasoning using deep networks, we mention the following selected recent contributions: Tensor-based approaches have been used [20, 9, 21], following [3]. However, approaches are restricted in terms of logical expressivity and/or to toy examples [22, 20]. [1] perform Resource Description Framework (RDF) reasoning based on KG embeddings. [23] considers OWL RL reasoning [24]. There is a fundamental difference between these contributions and our approach, though: We train our model once, and then the model transfers to all other RDF KGs with good performance. In the above mentioned publications, training is either done on (a part of the) KG which is also used for evaluation, or training is explicitly done on similar KGs, in terms of topic. More precisely, in case of [23], it requires re-training for obtaining embeddings for new vocabularies. 2. Problem Formulation To explain what we are setting out to do, let us first re-frame the deductive reasoning problem as a classification task. Any given logic  comes with an entailment relation ⊧ ⊆ 𝑇 × 𝐹 , where 𝐹 is a subset of the set of all logical formulas (or axioms) over , and 𝑇 is the set of all theories (or sets of logical formulas) over . If 𝑇 ⊧ 𝐹 , then we say that 𝐹 is entailed by 𝑇 . Re-framed as a classification task, we can ask whether a given pair (𝑇 , 𝐹 ) ∈ 𝑇 × 𝐹 should be classified as a valid entailment (i.e., 𝑇 ⊧ 𝐹 ) holds, or as the opposite (i.e., 𝑇 ̸⊧ 𝐹 ). We would like to train a model on sets of examples (𝑇 , 𝐹 ), such that it learns to correctly classify them as valid or invalid inferences. We wish to train a neural model that will learn to reason over one set of theories, and can then transfer that learning to new theories over the same logic. This way, our results will demonstrate that the reasoning principles (entailment under the model-theoretic semantics) that underlie the logic have been learned. If we were to train a model such that it learns only to reason over one theory, or a few very similar theories, then that could hardly be demon- strated. One of the key obstacles we face with our task is to understand how to represent training and test data so that they can be used in deep learning settings. To use standard deep learning approaches, formulas – or even theories – will have to be represented over the real coordinate space 𝑅 as vectors, matrices or tensors. Many embeddings for RDF (i.e., KGs) have been proposed [25, 6, 26], but we are not aware of an existing embedding that captures what seems important for the deductive reasoning scenario. Indeed, the prominent use case explored for KG embeddings is not deductive in nature; rather, it concerns the problem of the discovery or suggestion of additional links or edges in the graph, together with appropriate edge labels. In this link discovery setting, the actual labels for nodes or edges in the graph, and as such their commonsense meanings, are likely important, and most existing embeddings reflect this. However, for deductive reasoning the names of entities are insubstantial and should not be cap- tured by an embedding. Another inherent problem in the use of such representations across KGs is the OOV problem. While a word lookup table can be initialized with vectors in an unsu- pervised task or during training of the reasoner, it still cannot generate vector representations for unseen terms. It is further impractical to store the vectors of all words when vocabulary size is huge [10]. Similarly, memory networks usually rely on word-level embedding lookup tables, i.e., learned with the underlying rationale that words that occur in similar supervised scenarios should be represented by similar vectors in the real coordinate space. That is why they are known to have difficulties dealing with OOV, as a word lookup table cannot provide a representation for the unseen, and thus cannot be applied to NLI over new words [13], and for us this would pose a challenge in the transfer to new KGs. We thus need embeddings that are agnostic to the terms (i.e., strings) used as primitives in the KG. To build such an embedding, we use syntactic normalization: a renaming of primitives from the logical language (variables, constants, functions, predicates) to a set of predefined entity names that are used across different normalized theories. By randomly assigning the mapping for the renaming, the network’s learning will be based on the structural information within the theories, and not on the actual names of the primitives. Note that this normalization does not only play the role of “forgetting” irrelevant label names, but also makes it possible to transfer learning from one KG to the other. Indeed, for the approach to work, the network should be trained with many KGs, and then subsequently tested on completely new ones which had not been encountered during training. Our results show that our simple but very effective normalization yields a word-agnostic system capable of deductive reasoning over previously- unseen RDF KGs containing new vocabulary. 3. Model Architecture We consider a model architecture that adapts the end-to-end memory network proposed by [19] with fundamental alterations necessary for abstract reasoning. A high-level view of our model is shown in Figure 1. It takes a discrete set 𝐺 of normalized RDF statements (called triples) 𝑡1 , ..., 𝑡𝑛 that are stored in memory, a query 𝑞, and outputs a “yes” or “no” answer to determine if 𝑞 is entailed by 𝐺. Each of the normalized 𝑡𝑖 and 𝑞 contains symbols coming from a general dictionary with 𝑉 normalized words shared among all of the normalized RDF theories in both training and test sets. The model writes all triples to the memory and then calculates a continuous embedding for 𝐺 and 𝑞. Through multiple hop attention over those continuous representations, the model then classifies the query. The model is trained by back-propagation of error from output to the input through multiple memory accesses. We discuss components of the architecture in more detail below. Figure 1: Diagram of the proposed model, for K=1 Model Description The model is augmented with an external memory component that stores the embeddings of the normalized triples in our KG. This memory is defined as an 𝑛 × 𝑑 tensor where 𝑛 denotes the number of triples in the KG and 𝑑 is the dimensionality of the embeddings. The KG is stored in the memory vectors from two continuous representations of 𝑚𝑖 and 𝑐𝑖 obtained from two input and output embedding matrices of A and C with size 𝑑 × 𝑉 where 𝑉 is the size of vocabulary. Similarly, the query 𝑞 is embedded via a matrix 𝐵 to obtain an internal state 𝑢. In each reasoning step, those memory slots useful for finding the correct answers should have their contents retrieved. To enable this, we use an attention mechanism for 𝑞 over memory input representations by taking an internal product followed by a softmax: 𝑝𝑖 = Softmax(𝑢 𝑇 (𝑚𝑖 )) (1) 𝑒 (𝑎𝑖 ) where Softmax(𝑎𝑖 ) = . ∑𝑗 𝑒 (𝑎𝑗 ) Equation (1) calculates a probability vector 𝑝 over the memory inputs, the output vector 𝑜 is then computed as the weighted sum of the transformed memory contents 𝑐𝑖 with respect to their corresponding probabilities 𝑝𝑖 by 𝑜 = ∑𝑖 𝑝𝑖 𝑐𝑖 . This describes the computation within a single hop. The internal state of the query vector updates for the next hop as 𝑢 𝑘+1 = 𝑢 𝑘 + 𝑜 𝑘 . The process repeats 𝐾 times where 𝐾 is the number of computational hops. The output of the 𝐾 𝑡ℎ hop is used to predict the label 𝑎̂ by passing 𝑜 𝐾 and 𝑢 𝐾 through a weight matrix of size 𝑉 × 𝑑 and a softmax: 𝑎̂ = Softmax(𝑊 (𝑢 𝐾 +1 )) = Softmax(𝑊 (𝑢 𝑘 + 𝑜 𝑘 )). Figure 1 shows the model for 𝐾 = 1 (1 hop). The learning parameters are the matrices 𝐴, 𝐵, 𝐶, and 𝑊 . Memory Content There is a plethora of logics which could be used for our investigation. Here we use RDF. The RDF [27] is an established and widely used W3C standard for expressing KGs. An RDF KG is a collection of statements stored as triples (𝑒1, 𝑟, 𝑒2) where 𝑒1 and 𝑒2 are called subject and object, respectively, while 𝑟 is a relation binding 𝑒1 and 𝑒2 together. State- ments can constitute base facts (logically speaking, in this case 𝑒1 and 𝑒2 would be constants, and 𝑟 a binary predicate) or simple logical axioms (e.g., 𝑒1 and 𝑒2 could identify unary prediates or classes, and 𝑟 would be class subsumption or material implication). Every entity in an RDF KG is represented by a unique Universal Resource Identifier (URI). We normalize these triples by systematically renaming all URIs which are not in the RDF/RDFS (Schema) namespaces as discussed previously. Each such URI is mapped to a set of arbitrary strings in a predefined set  = {𝑎1 , ..., 𝑎𝑛 }, where 𝑛 is taken as a training hyper-parameter giving an upper bound for the largest number of entities in a KG the system will be able to handle. Note that URIs in the RDF/RDFS namespaces are not renamed, as they are important for the deductive reasoning according to the RDF model-theoretic semantics. Consequently, each normalized RDF KG will be a collection of facts stored as triples {(𝑎𝑖 , 𝑎𝑗 , 𝑎𝑘 )}. It is important to note that each symbol is mapped into an element of  regardless of its position in the triple, and whether it is a subject or an object or a predicate. Yet the position of an element within a triple is an important feature to consider. Thus we employ a positional encoding(PE) [19] to encode the position of each element within the triple. Let 𝑗th element of 𝑖th triple be 𝑡𝑖,𝑗 . This gives us memory vector representation of each triple as 𝑚𝑖 = ∑𝑗 𝑙𝑗 ◦𝑡𝑖,𝑗 , where ◦ is the Hadamard (element-wise) product and 𝑙𝑗 is a column vector with the structure 𝑙𝑘,𝑗 = (1 − 𝑗/3) − (𝑘(1 − 2𝑗/3)/𝑑) (assuming 1-based indexing), where 𝑑 is the size of the embedding vector in the memory embedding matrix and the 3 in the denominator corresponds to the num- ber of elements in an RDF triplet. Each memory slot thus represents the positional-weighted summation of each triplet. The positional encoding ensures that the order of the elements now affects the encoding of each memory slot. 4. Evaluation The RDF semantics standard specification [28] describes a prodecural semantics based on 13 completion rules, which can be used to algorithmically compute logical consequences. The completion of an RDF KG is in general infinite because, by definition, there is an infinite set of facts (related to RDF-encodings of lists) which are always entailed – however for practi- cal reasons, and as recommended in the standard specification, only certain finite subsets are computed as completions of RDF KGs, and we do the same. Dataset There are many RDF KGs available on the World Wide Web that can be used to create our own dataset. For this purpose, we have collected RDF datasets from the Linked Data Cloud1 and the Data Hub2 to create our datasets.3 Our training set (which by coincidence was based on RDF data conforming also to the OWL standard [24] and which we call an “OWL- centric” dataset) is comprised of a set of RDF KGs each of size 1,000 triples, sampled from populating around 20 OWL ontologies with different data. In order to test our model’s ability 1 https://lod-cloud.net/ 2 https://datahub.io/ 3 https://github.com/Monireh2/kg-deductive-reasoner to generalize to completely different datasets, we have collected another dataset which we call the OWL-Centric Test Set. Furthermore, to assure our evaluation represents real-world RDF data completely independent of the training data, we have used almost all RDF KGs listed in a recent RDF quality survey [29]; we call this the Linked Data test set. Further, to test the limitations of our model on artificially difficult data, we have created a small synthetic dataset which requires long reasoning chains if done with a symbolic reasoner. For each KG we have created the finite set of inferred triples using the Apache Jena4 API. These inferred triples comprise our positive class instances. For generating invalid instances we used the following two methods. In the first, we generated non-inferred triples by random permutation of triple entities and removing those triples which were entailed. In the second scenario, which serves as our final quality check for not including trivially invalid triples in our dataset, we created invalid instances using the rdf:type predicate. More specifically, for each valid triple in the dataset, we replaced one of the elements (chosen randomly), with another random element which qualifies for being placed in that triple based on its rdf:type relation- ships. The datasets created by this strategy are marked with superscript “a” in Table 1. Training Details Trainings were done over 10 epochs using the Adam optimizer with a learning rate of 𝜂 = 0.005, a learning rate decay of 𝜂/2, and a batch size of 100 over triples. For the final batches of queries for each KG, we have used zero-padding to the maximum batch size of 100. The capacity of the external memory is 1,000 which is also the maximum size of our KGs. We used a linear starting of 1 epoch where we have removed the softmax from each memory layer except for the final layer. L2 norm clipping of max 40 was applied to the gradient. The memory input/output embeddings are vectors of size 20. The embedding matrices of A, B, and C therefore are of size |𝑉 | × 𝑑 = 3, 033 × 20, where 3,033 is the size of the normalized generic vocabulary plus RDF(S) namespace vocabulary. Unless otherwise mentioned, we have used 𝐾 = 10. Adjacent weight sharing was used where the output embedding of one layer is the input embedding of the next one, as in 𝐴𝑘+1 = 𝐶. Similarly, the answer prediction weight matrix 𝑊 gets copied to the final output embedding 𝐶 𝐾 and query embedding is equal to the first layer input embedding as in 𝐵 = 𝐴1 . All the weights are initialized by a Gaussian distribution with 𝜇 = 0 and 𝜎 = 0.1. We would like to emphasize again that one and the same trained model was used in the evaluation over different test sets. We did not retrain, e.g., on Linked Data for the Linked Data test set. Quantitative Results We now present and discuss our evaluation results. Our evaluation metrics are average of precision and recall and f-measure over all the KGs in the test set, ob- tained for both valid and invalid sets of triples. We also report the recall for the class of neg- atives (specificity) to interpret the result more carefully by counting the number of true nega- tives. Additionally, as mentioned earlier, we have done zero-padding for each batch of queries with size less than 100. This implies the need for introducing another class label for such zero paddings both in the training and test phase. We have not considered the zero-padding class in the calculation of precision, recall and f-measure. Through our evaluations, however, we have observed some missclassifications from/to this class. Thus, we report accuracy as well. 4 https://jena.apache.org/ To the best of our knowledge there is no architecture capable of conducting deductive rea- soning on completely unseen RDF KGs. In addition, NTP and LTNs appear to have severe scalability issues, which means we cannot compare them to our system at scale. Neighbour- hood approximated Neural Theorem Provers [30] heavily rely on entity embeddings, making it unsuitable for our goal, as discussed. That is why we have considered the non-normalized embedding version of our memory network as a baseline. Similarly, Graph-to-Graph learning architecture [1] is ontology-specific model. In fact, after training such model on one domain, you need to adapt the model hyper-parameters for another one and start the training from scratch on a different width model. Beside that, the Graph-to-Graph model is not scalable to large ontologies like DBpedia; instead it restricts the vocabulary to small restricted domain datasets. These inherent limitations for cross-ontology adaptation and the generative nature of the model (as opposed to classification in our setup) makes the comparison impossible. Our technique shows a significant advantage over the baseline as shown in Table 1. A further even more important benefit of using our normalization model is its training time. In fact, this considerable time complexity difference is the result of the remarkable size difference of embedding matrices in the original and normalized cases. For instance, the size of embedding matrices to be learned by our algorithm for the normalized OWL-Centric dataset is 3, 033×20 as opposed to 811, 261 × 20 for the non-normalized one (and 1, 974, 062 × 20 for Linked Data which is prohibitively big). That has caused a remarkably high decrease in training time and space complexity and consequently has helped the scalability of our memory networks. In case of the OWL-Centric dataset, for instance, the space required for saving the normalized model is 80 times less than the intact model (≈ 4𝐺𝐵 after compression). Nevertheless, the normalized model is almost 40 times faster to train than the non-normalized one for this dataset. Our normalized model trained for just a day on OWL-Centric data but achieves better accuracy, whereas it trained on the same non-normalized dataset more than a week on a 12-core machine. Hence, the importance of using normalization cannot be emphasized enough. To further get an idea of how our model performs on different data sources, we have applied our approach on multiple datasets with various characteristics. The result across all variations are given in Table 1. From this Table we can see that, apart from our strikingly good perfor- mance compared to the baseline, there are number of other interesting points: Our model gets even better results on the Linked Data task while it has trained on the OWL-Centric dataset. We hypothesize that this may be due to a generally simpler structure of Linked Data, but vali- dating this will need further research. The large portion of our few false negative instances come from the inability of our model to infer that all classes are subclass of themselves. Another interesting observation is the poor performance of our algorithm when it has trained on the OWL-Centric dataset and tested on a tricky version of the Linked Data. In that case our model has classified most of the triples to the “yes” class and this has led to low specificity (recall for “no” class) of 16%. This seems inevitable because in this case the negative instances bear close resemblance to the positives ones, making differentiation more challenging. However, training the model on the tricky OWL-Centric dataset has improved that by a substantial margin (more than three times). In case of our particularly challenging synthetic data, performance is not as good, and this may be due to the very different nature of this dataset which would require much longer reasoning chains than the non-synthetic data. Our training so far has only been done on real-world datasets; it may Valid Triples Class Invalid Triples Class Training Dataset Test Dataset Accuracy Recall Recall Precision F-measure Precision F-measure /Sensitivity /Specificity OWL-Centric Dataset Linked Data 93 98 96 98 93 95 96 OWL-Centric Dataset (90%) OWL-Centric Dataset (10%) 88 91 89 90 88 89 90 OWL-Centric Dataset OWL-Centric Test Set b 79 62 68 70 84 76 69 OWL-Centric Dataset Synthetic Data 65 49 40 52 54 42 52 OWL-Centric Dataset Linked Data a 54 98 70 91 16 27 86 OWL-Centric Dataset a Linked Data a 62 72 67 67 56 61 91 OWL-Centric Dataset(90%) a OWL-Centric Dataset(10%) a 79 72 75 74 81 77 80 OWL-Centric Dataset OWL-Centric Test Set ab 58 68 62 62 50 54 58 OWL-Centric Dataset a OWL-Centric Test Set ab 77 57 65 66 82 73 73 OWL-Centric Dataset Synthetic Data a 70 51 40 47 52 38 51 OWL-Centric Dataset a Synthetic Data a 67 23 25 52 80 62 50 Baseline OWL-Centric Dataset Linked Data 73 98 83 94 46 61 43 OWL-Centric Dataset (90%) OWL-Centric Dataset (10%) 84 83 84 84 84 84 82 OWL-Centric Dataset OWL-Centric Test Set b 62 84 70 80 40 48 61 OWL-Centric Dataset Synthetic Data 35 41 32 48 55 45 48 a More Tricky Nos & Balanced Dataset b Completely Different Domain. Table 1 Experimental results of proposed model be interesting to more closely investigate our approach when trained on synthetic data, but that was not the purpose of our study. It appears natural to analyze the reasoning depth acquired by our network. We conjecture that reasoning depth acquired by the network will correspond both to (1) the number of layers in the deep network, and (2) the ratio of deep versus shallow reasoning required to perform the deductive reasoning. Forward-chaining reasoners iteratively apply inference rules in order to derive new entailed facts. In subsequent iterations, the previously derived facts need to be taken into account. To gain a first understanding of what our model has learned in this respect, we have mimicked this symbolic reasoner behavior in creating our test set. We first started from our input KG 𝐾0 in hop 0. We then produced, subsequently, KGs of 𝐾1 ,..., 𝐾𝑛 until no new triples are added (i.e. 𝐾𝑛+1 is empty) by applying the RDF inference rules from the specification: The hop 0 dataset contains the original KG’s triples in the inferred axioms, hop 1 contains the RDF(S) axiomatic triples. The real inference steps start with 𝐾𝑛 where 𝑛 >= 2. Table 2 summarizes our results in this setup. Unsurprisingly, we observe that result over our synthetic data is poor. This may be because of the huge gap between the distribution of our training data over reasoning hops and the synthetic data reasoning hop length distribution as shown in the first row of Table 2. From that, one can see how the distribution of our training set affects the learning capability of our model. Apart from our observations, previous studies [31, 9, 32] also corroborate that the reasoning chain length in real-world KGs is limited to 3 or 4. Hence, a synthetic training toy set would have to be built as part of follow-up work. General Embeddings Visualization In order to gain some insight on the nature of our nor- malized embeddings, we have plotted a Principal Component Analysis (PCA) two-dimensional vector visualization of embeddings computed for the RDF(S) terms and all normalized words in the KGs, in Figure 2. The embeddings were fetched from the matrix B (embedding query lookup table) in the hop 1 of our model trained over the OWL-Centric dataset. Words are positioned in the plot based on the similarity of their embedding vectors. As anticipated, all the normalized Hop 1 Hop 2 Hop 3 Hop 4 Hop 5 Hop 6 Hop 7 Hop 8 Hop 9 Hop 10 Dataset F% D% F% D% F% D% F% D% F% D% F% D% F% D% F% D% F% D% F% D% OWL-Centrica - 8 - 67 - 24 - 1 - - - - - - - - - - - - OWL-Centricb 42 5 78 64 44 30 6 1 - - - - - - - - - - - - Linked Datac 88 31 93 50 86 19 - - - - - - - - - - - - - - Linked Data d 86 34 93 46 88 20 - - - - - - - - - - - - - - Synthetic 38 0.03 44 1.42 32 1 33 1.56 33 3.09 33 6.03 33 11.46 31 20.48 31 31.25 28 23.65% a Training set b Completely different domain c LemonUby Ontology d Agrovoc Ontology Table 2 F-measure and Data Distribution over each reasoning hop Figure 2: PCA projection of embeddings for the vocabulary words tend to form one cluster as opposed to multiple ones. The PCA projection illustrates the ability of our model to automatically organize RDF(S) concepts and learn implicitly the rela- tionships between them. For instance, rdfs:domain and rdfs:range have been located very close together and far from normalized entities. rdf:subject, rdf:predicate and rdf:object vectors are very similar, and the same for rdf:seeAlso and rdf:isDefinedBy. Likewise, rdfs:container, rdf:bag, rdf:seq, and rdf:alt are in the vicinity of each other. rdf:langstring is the only RDF(S) entity which is inside the normalized entities cluster. We believe that it is because rdf:langString’s domain and range is string and consequently it has mainly co-occurred with normalized instances in the KGs. Another possible reason for this is its low frequency in our data. 5. Conclusions and Future Work We have demonstrated that a deep learning architecture based on memory networks and pre- embedding normalization is capable of learning how to perform deductive reason over previ- ously unseen RDF KGs with high accuracy. We believe that we have thus provided the first deep learning approach that is capable of high accuracy RDF deductive reasoning over previ- ously unseen KGs. Normalization appears to be a critical component for high performance of our system. We plan to investigate its scalability and to adapt it to other, more complex, logics. 6. Acknowledgements This work was supported by the Air Force Office of Scientific Research under award number FA9550-18-1-0386 and by the National Science Foundation (NSF) under award OIA-2033521 "KnowWhereGraph: Enriching and Linking Cross-Domain Knowledge Graphs using Spatially- Explicit AI Technologies." References [1] B. Makni, J. Hendler, Deep Learning for Noise-tolerant RDFS Reasoning, Ph.D. thesis, Rensselaer Polytechnic Institute, 2018. [2] S. Riedel, L. Yao, A. McCallum, B. M. Marlin, Relation extraction with matrix factorization and universal schemas, in: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technolo- gies, 2013, pp. 74–84. [3] R. Socher, D. Chen, C. D. Manning, A. Ng, Reasoning with neural tensor networks for knowledge base completion, in: Advances in neural information processing systems, 2013, pp. 926–934. [4] B. Yang, W.-t. Yih, X. He, J. Gao, L. Deng, Embedding entities and relations for learning and inference in knowledge bases, arXiv preprint arXiv:1412.6575 (2014). [5] K. Toutanova, D. Chen, P. Pantel, H. Poon, P. Choudhury, M. Gamon, Representing text for joint embedding of text and knowledge bases, in: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 2015, pp. 1499–1509. [6] T. Trouillon, J. Welbl, S. Riedel, É. Gaussier, G. Bouchard, Complex embeddings for simple link prediction, in: International Conference on Machine Learning, 2016, pp. 2071–2080. [7] B. Peng, Z. Lu, H. Li, K.-F. Wong, Towards neural network-based reasoning, arXiv preprint arXiv:1508.05508 (2015). [8] R. Das, A. Neelakantan, D. Belanger, A. McCallum, Chains of reasoning over entities, re- lations, and text using recurrent neural networks, arXiv preprint arXiv:1607.01426 (2016). [9] T. Rocktäschel, S. Riedel, End-to-end differentiable proving, in: Advances in Neural Information Processing Systems, 2017, pp. 3788–3800. [10] W. Ling, T. Luís, L. Marujo, R. F. Astudillo, S. Amir, C. Dyer, A. W. Black, I. Trancoso, Finding function in form: Compositional character models for open vocabulary word representation, arXiv preprint arXiv:1508.02096 (2015). [11] T. Kudo, Subword regularization: Improving neural network translation models with multiple subword candidates, arXiv preprint arXiv:1804.10959 (2018). [12] R. Sennrich, B. Haddow, A. Birch, Neural machine translation of rare words with subword units, arXiv preprint arXiv:1508.07909 (2015). [13] D. Bahdanau, T. Bosc, S. Jastrzębski, E. Grefenstette, P. Vincent, Y. Bengio, Learning to compute word embeddings on the fly, arXiv preprint arXiv:1706.00286 (2017). [14] M. Eric, C. D. Manning, A copy-augmented sequence-to-sequence architecture gives good performance on task-oriented dialogue, arXiv preprint arXiv:1701.04024 (2017). [15] D. Raghu, N. Gupta, et al., Hierarchical pointer memory network for task oriented dia- logue, arXiv preprint arXiv:1805.01216 (2018). [16] A. Talman, S. Chatzikyriakidis, Testing the generalization power of neural network mod- els across nli benchmarks, arXiv preprint arXiv:1810.09774 (2018). [17] W. Xiong, T. Hoang, W. Y. Wang, Deeppath: A reinforcement learning method for knowl- edge graph reasoning, arXiv preprint arXiv:1707.06690 (2017). [18] A. Bordes, J. Weston, R. Collobert, Y. Bengio, Learning structured embeddings of knowl- edge bases, in: AAAI, AAAI Press, 2011. [19] S. Sukhbaatar, J. Weston, R. Fergus, et al., End-to-end memory networks, in: Advances in neural information processing systems, 2015, pp. 2440–2448. [20] R. Evans, D. Saxton, D. Amos, P. Kohli, E. Grefenstette, Can neural networks understand logical entailment?, arXiv preprint arXiv:1802.08535 (2018). [21] L. Serafini, A. S. d. Garcez, Learning and reasoning with logic tensor networks, in: Con- ference of the Italian Association for Artificial Intelligence, Springer, 2016, pp. 334–348. [22] F. Bianchi, P. Hitzler, On the capabilities of logic tensor networks for deductive reasoning., in: AAAI Spring Symposium: Combining Machine Learning with Knowledge Engineer- ing, 2019. [23] P. Hohenecker, T. Lukasiewicz, Ontology reasoning with deep neural networks, arXiv preprint arXiv:1808.07980 (2018). [24] P. Hitzler, M. Krötzsch, B. Parsia, P. F. Patel-Schneider, S. Rudolph (Eds.), OWL 2 Web Ontology Language: Primer (Second Edition), W3C Recommendation 11 December 2012, 2012. Available from http://www.w3.org/TR/owl2-primer/. [25] A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, O. Yakhnenko, Translating embed- dings for modeling multi-relational data, in: Advances in neural information processing systems, 2013, pp. 2787–2795. [26] Z. Wang, J. Zhang, J. Feng, Z. Chen, Knowledge graph embedding by translating on hyperplanes., in: AAAI, volume 14, 2014, pp. 1112–1119. [27] P. Hitzler, M. Krotzsch, S. Rudolph, Foundations of semantic web technologies, Chapman and Hall/CRC, 2009. [28] P. J. Hayes, P. F. Patel-Schneider (Eds.), RDF 1.1 Semantics, 2014. Available from http://www.w3.org/TR/rdf11-mt/. [29] S. Sam, P. Hitzler, K. Janowicz, On the quality of vocabularies for linked dataset papers published in the semantic web journal, Semantic Web 9 (2018) 207–220. [30] P. Minervini, M. Bosnjak, T. Rocktäschel, E. Grefenstette, S. Riedel, Scalable neural theo- rem proving on knowledge bases and natural language (2018). [31] R. Das, S. Dhuliawala, M. Zaheer, L. Vilnis, I. Durugkar, A. Krishnamurthy, A. Smola, A. McCallum, Go for a walk and arrive at the answer: Reasoning over paths in knowledge bases using reinforcement learning, arXiv preprint arXiv:1711.05851 (2017). [32] F. Yang, Z. Yang, W. W. Cohen, Differentiable learning of logical rules for knowledge base reasoning, in: Advances in Neural Information Processing Systems, 2017, pp. 2319–2328.