1. Introduction

Leveraging a domain ontology in (neural) learning from heterogeneous data

Tomas Martin

Petko Valtchev

Abdoulaye Baniré Diallo

0 0 Centre de Recherche en Intelligence Artificielle (CRIA), UQAM , Montreal , Canada

Injecting domain knowledge into a neural learner to alleviate reliance on high-quality data and improve explainability is a rapidly expanding research trend. While most of the efort focused on regular topology formats such as sequences and grids, we consider graph datasets. Moreover, instead of knowledge graph (KG) embedding that underlies the majority of graphcentered methods, we propose a dedicated pattern mining-based approach. As our patterns are ontologically-generalized, they achieve multiple objectives: domain knowledge infusion, generalization capacity enhancement, interpretability, etc.

eol>Domain Ontology Symbolic methods Sub-symbolic methods Neural networks Graph pattern mining

1. Introduction 2. Motivation

Nowadays, implementing decision support systems to Daily activities in agro-industrial sector, e.g. a maintehelp practitioners in complex activities has become a nance of a dairy farm, like those in other areas related current practice in many fields. Many of these sys- to life sciences, generate large amounts of data. The tems, traditionally, have used machine learning to pre- underlying data sources reflect complementary aspects dict the outcome of a specific problem in the user’s en- such as farm yield, environment, animal health, gevironment and use the prediction to suggest concrete netics, etc. The recent trend of precision(-based) agriactions. Deep learning has arrived with a promise to culture looks at exploiting this data to support the deexpand the areas where automation is successfully ap- cision making of domain stake-holders [3]: farmers, plied in problem solving, hence the expectation for agronomists, dairy companies, insurers, etc. high-quality decision support to profuse. Yet, in order to be efective, any recommendation

However, predicting or learning representations on will have to reflect existing practices and, more genersuch complex domains typically requires the availabil- ally, at least partly reflect the general knowledge from ity of large amounts of data of suficiently high qual- the domain. For instance, at the end of each lactation a ity. Unfortunately, in practice, such datasets are not cow gets dry for a while. Yet there is no a straightforalways readily available. Conversely, often quantities ward way to train a neural model on milk yield data: of machine-readable expert knowledge do exist, and The ensuing abrupt drop in milk yield is hard to digest could potentially complement already available data. for, at least, the most popular deep learning architecSince they reflect at least partly the expertise that un- tures [4]. Indeed, these models do not seem to properly derlies decision making in the field, it is only natu- grasp the dynamics in a cow life-cycle, e.g. lactation, ral to look for ways to inject that knowledge into the calvings, drying, etc. learning process to try to guide it and compensate the While there are still work-arounds left to explore, scarceness of high-quality data. one legitimate research question is whether injecting

For several decades, ontologies, i.e. structured rep- some domain knowledge would help here. In a broader resentation of domain concepts and their relations [1], approach, we investigate the impact of feeding comhave been promoted as the appropriate tool for mak- plementary data, e.g. on genetics and animal health, ing domain knowledge available for machine process- and organizing the overall dataset under a domain oning [2]. tology (DO) providing additional descriptive knowledge.

While supplementing a neural learner with domain knowledge stemming from an ontology is definitely appealing, it is also a challenging task, mainly due to the “impedance mismatch”, i.e. the divergence in the respective levels of knowledge expression and manipulation [5].

Proceedings of the CIKM 2020 Workshops, October 19-20, Galway, Ireland. email: martin.tomas@courrier.uqam.ca (T. Martin); valtchev.petko@uqam.ca (P. Valtchev); diallo.abdoulaye@uqam.ca (A.B. Diallo)

© 2020 Copyright for this paper by its authors. Use permitted under Creative CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g CCoEmUmoRns WLiceonrsekAsthtriobuptioPnr4o.0cIneteerdnaitniognasl ((CCC EBYU4R.0)-.WS.org)

3. Current State Of The Art

able to DOs as in the case of KGs amalgamation is favoured by them being on the same abstraction level While symbolic representations, as a way to capture as the training data. In contrast, classes and properknowledge, have clearly dominated the AI field since ties from a DO represent abstractions, i.e. sets of data its inception, recently sub-symbolic ones –in the form objects and object-to-object links, hence the apparent of trained neural networks– have rapidly gained in mismatch with the instance-centered modus operandi popularity and use [6]. By trading discrete and man- of an ANN. Yet given the strive for (proper) generalmade (i.e. modelling) entities of the former for more ization in ANNs, the ontological structure, with its camachine-made (artificial) and loosely defined “patterns”, pacity to generalize along expert-validated conceptual the later breaks free of prior knowledge in order to, hierarchies (and property ones, for that matters), is a arguably, benefit for a more powerful yet dificult to natural ally. interpret representation language. At its core, infor- Nonetheless, a few studies have tackled the exploitamation is distilled throughout a network as a set of tion of generic knowledge from a DO in neural learnwaves (or pulses) representing captured knowledge. ing. For instance, in [13], the authors exploit a DO

In a broader scope, injecting domain knowledge in (a topic hierarchy, in actuality) of sound events to ena machine learning process has been extensively re- hance a neural classifier. They propose to replicate the searched and proven helpful in many practical situa- hierarchical structure of the DO in the ANN topology tions [7]. More recently, since deep learning has moved by: (1) allotting a layer per level in the is-a hierarcentre stage, the focus has shifted on making neural chy and (2) enforcing fixed distance values between networks collaborate with symbolic knowledge sources, pairs of example embeddings, which roughly translate mostly knowledge graphs (KG) and, somewhat more the examples’ topological distance within the hierarmodestly, domain ontologies. In [5], the authors pro- chy. In a similar vein, the method in [14] simulates pose a classification of methods for feeding domain the topology of the DO graph in learning the repreknowledge to artificial neural networks (ANNs), in par- sentations of its classes and properties. A class is thus ticular, to deep ones. Their own proposal, called know- reduced to the union of its data properties, those of its ledge-infused learning (K-IL), addresses a variety of is- sub-classes and of related classes. On a following step, sues with ANN, in particular, reliance on large datasets the method learns instance representations, from the of suficient quality, biases in training data selection, representations in the DO, and uses them in behaviour complexity, etc. The proposed answer represents a prediction. spectrum of fine-grained transformations of the ANN Besides, diferent ways of making ontologies and architecture reflecting the content of a KG that range ANN collaborate have been explored, e.g. ontology from correcting the loss function to modifying the prop- learning [15] or neural reasoning with ontologies [16]. agation through the network via connection weights. For example, [17] approaches the latter task as a trans

The broader trend of using KG in the form of em- lation problem with noisy-data. beddings –of vertices, edges or both– into a vector On a broader scope, while feature vector-oriented space, e.g. in order to support various natural lan- ANNs have shined on sequence- and grid-shaped data, guage processing tasks, has been highly prolific for al- i.e. with values arguably more important than –the most a decade now (see [8] for a somewhat outdated highly regular– topology, graph data, due to its insurvey). While initial work by Bordes et al. [9, 10] herent sparsity, requires more fine-grained generalizalooked at embedding a triple from a KG using energy- tion (i.e. chemical functional groups, biological pathbased methods to force plausible combinations of com- ways, telecom network configurations, etc.). Graph ponent embeddings, in [11] the focus is exclusively on Convolutional Neural Networks (GCNN) constitute a vertices, i.e. domain entities. The proposed RDF2vec recent and promising approach for learning such regmethod generates a set of entity sequences, through ularities [18, 19]. By applying convolution layers on random walks and iterative neighbourhood encoding top of each other, they recursively aggregate ℎ -order techniques, which are then fed to word embedding meth- neighbourhood information from the graph and can ods. In a medical context, the authors of [12], present a achieve good generalization on such datasets. Yet due somewhat diferent approach toward leveraging a KG to their inherent bias toward frequent regularities, the in neural learning: In order to assess patient risk from very local, rare and context-specific ones will arguably a series of health events, they translate the neighbour- be missed. And, clearly, this behaviour compounds hoods of an event-centred KG into attention filters for whenever quality data prove scarce. a LSTM-based ANN. Beyond pure generalization capabilities, dealing with

Overall, KG embedding is not straightforwardly port- actionable and surprising patterns mixing diferent abstraction levels is to be expected: conceptually, a sequentially layered generalization procedure might not prove enough to extract such regularities.

Taking a step back, we consider three ongoing trends each one following a founding principle: (i) K-IL supports the use of external domain knowledge as a way to bring improvements on both predictive power and explainability; (ii) G(C)NN approaches consider pre- Figure 1: An example of ontological graph pattern. serving topology as critical when working on graph data; (iii) Contextual mechanisms lead to better results on both static (e.g. text translation) and dynamic (e.g. clude abstractions on both vertices (ontology classes) user behaviour) predictive tasks [20]. and edges (ontology properties). As an illustrative ex

To the best of our knowledge, no prior research has ample, Figure 1 presents a possible pattern, illustrating jointly addressed the above three concerns. Here, we possible causes for a shorter than average first lactapresent a novel approach for learning in complex do- tion of a young cow. Here, frequently, both the young mains that does this. By delegating most of the knowl- cow and its ancestor have been treated with diferent edge/pattern extraction efort to a dedicated symbolic kind of antibiotics. method, we subsequently feed those patterns as in- The resulting graph structure can be qualified as put features to an architecture-agnostic neural learner. doubly-labelled, i.e. on both vertices and edges, multiThus, ofering ontologically-generalized graph-shaped graphs. Practically, we first discover the interesting features a priori overlapping with a GNN’s convolved patterns and then, in a feature engineering step, we high-level patterns. Nevertheless, ontological based assign them as higher-level descriptors of the matchgeneralization plays nice with robustness properties: ing data graphs. by going beyond mere boolean encoding of attributes Another palpable advantage of using the ontology(i.e. vertices, edges) with the help of a DO’s conceptual based patterns is that they ofer an integrated view structure, it helps the symbolic learning to not fall into of the shared structure: Edges standing for properties overfitting pitfalls. connect class vertices, thus providing context to each of them. On the other hand, pattern components, as well as whole patterns, pertain to potentially varying 4. Vision & Approach details abstraction levels.

More concretely, Figure 2 details our hybrid strategy First, by bringing some ontological concepts into the where graph patterns are first mined (step 3) and then data as higher-order regularities we aim to make ex- fed (step 5) into a neural network (step 6) with graph plicit the shared conceptual structure that remains in- data supported by a DO as our main input (steps 2 and visible in the raw data. The rationale therefore is while 1 respectively), complementary to regular tabular data exact values may mismatch, more abstract types de- (step 4). The mining of ontological patterns from that scribing those values would coincide. For instance, graph data uses the domain ontology as backbone for two groups of lactating cows may be treated for masti- the exploration (e.g. ontological types, resources as tis –a common bacterial infection of the udder– by us- vertices and properties as edges). Between steps 3 and ing amoxicillin and penicillin, respectively. Now know- 5, an optional post-processing step can also further reing that these are both -lactams helps extend a com- ifne the patterns to emphasize contrasts, or synthesize mon sub-graph comprising, at least, nodes for cow and by approximating, if required by the learning task. The mastitis, with a further node for that class of antibi- resulting ontological graph patterns allow the original otics. Obviously, this increase in the shared portions of data to be encoded with the new features supported the data graphs w.r.t. to their raw versions would not by domain knowledge before feeding the augmented be possible without an ontology covering the antibi- data to the ANN (steps 5 and 6). otics. In a more general vein, inserting typing infor- Pattern mining [21], aims at extracting recurrent data mation and property generalizations helps reveal hid- fragments, a.k.a. patterns, capturing the most relevant den commonalities that would not be easily spotted information possible. A mining task is defined by a neither by a human expert, neither by a sub-symbolic pair of languages, one for data records and one for patlearner. terns, and a relevance (interestingness) criterion. The

Next, our goal is to find all significant fragments of typical criterion is frequency of appearance, but other such shared structure in a set of data graphs. These in- criteria such as utility or some domain-related ones are possible. Moreover, an efective mining method requires a general strategy for pattern space traversal and a technique to perform a pattern-to-data record [1] T. Gruber, et al., A translation approach to matching. The later revolves around computing a vari- portable ontology specifications, Knowledge acation of sub-graph isomorphism, here integrating the quisition 5 (1993) 199–220. conceptual structure of an ontology. Typically, the for- [2] F. Kramer, T. Beißbarth, Working with ontolomer entails defining a spanning tree of the pattern space gies, in: Bioinformatics, Springer, 2017, pp. 123– and a canonical representation of graph patterns to 135. avoid generating multiple copies of the same pattern [22]. [3] A. Barbosa, et al., Modeling yield response

Ontologies have been used in frequent pattern min- to crop management using convolutional neural ing to guide the exploration of the complex pattern networks, Computers and Electronics in Agriculspaces such as sequences of objects or simple graphs ture 170 (2020) 105197. for some time [23, 24]. For predictive tasks exploiting [4] C. Frasco, et al., Towards an Efective Decisiongraph pattern mining a few successful techniques ex- making System based on Cow Profitability using ist such as quantitative structure-activity relationships Deep Learning:, in: Proc. of the 12th ICAART, (QSARs) [25], optimizing objective functions [26] or Valletta, Malta, 2020, pp. 949–958. dedicated pattern ranking metrics [27] exploiting ex- [5] A. Sheth, et al., Shades of knowledge-infused ternal domain knowledge. While ontologies and pat- learning for enhancing deep learning, IEEE Interns have been combined before, to the best of our ternet Computing 23 (2019) 54–63. knowledge, no mining method has targeted data of [6] Y. Bengio, et al., Representation learning: A resuch complexity. view and new perspectives, IEEE Transactions

The downside of the approach is its sensibility on on PAMI 35 (2013) 1798–1828. the pattern frequency threshold and the related poten- [7] D. Dou, et al., Semantic data mining: A survey of tial combinatorial explosion in the result. While this ontology-based approaches, in: IEEE ICSC 2015, is a serious cost issues with graph patterns, possible IEEE, 2015, pp. 244–251. mitigation strategies exist, e.g. using condensed rep- [8] Q. Wang, et al., Knowledge graph embedding: resentations thereof such as closed patterns [28]. A survey of approaches and applications, IEEE

Overall, expected immediate benefits of the on- TKDE 29 (2017) 2724–2743. tological knowledge injection into the neural learning [9] A. Bordes, et al., Translating embeddings for process include higher accuracy in predictive architec- modeling multi-relational data, in: Advances in ture and faster convergence. neural information processing systems, 2013, pp. 2787–2795. [10] A. Bordes, et al., A semantic matching energy function for learning with multi-relational data: Application to word-sense disambiguation, Ma- ships, in: International Conference on Discovery chine Learning 94 (2014) 233–259. Science, Springer, 2013, pp. 217–232. [11] P. Ristoski, H. Paulheim, Rdf2vec: Rdf graph em- [28] X. Yan, J. Han, CloseGraph: mining closed frebeddings for data mining, in: ISWC, Springer, quent graph patterns, in: Proc. of the 9th ACM 2016, pp. 498–514. SIGKDD, ACM, 2003, pp. 286–295. [12] C. Yin, et al., Domain knowledge guided deep learning with electronic health records, in: 2019

IEEE ICDM, IEEE, 2019, p. 738–747. [13] A. Jiménez, et al., Sound event classification using ontology-based neural networks, in: In Proc.

of the Annual Conference on NeurIPS, 2018. [14] N. Phan, et al., Ontology-based deep learning for human behavior prediction with explanations in health social networks, Information sciences 384 (2017) 298–313. [15] M. Casteleiro, et al., Ontology learning with deep learning: a case study on patient safety using pubmed., in: SWAT4LS, 2016. [16] P. Hohenecker, T. Lukasiewicz, Deep learning for

ontology reasoning, CoRR (2017). [17] B. Makni, J. Hendler, Deep learning for noisetolerant rdfs reasoning, Semantic Web 10 (2019) 823–862. [18] K. Xu, et al., How powerful are graph neural networks?, arXiv:1810.00826 [cs, stat] (2019). URL: http://arxiv.org/abs/1810.00826. [19] H. Yuan, et al., Xgnn: Towards model-level explanations of graph neural networks, 26th ACM

SIGKDD (2020) 430–438. [20] A. Vaswani, et al., Attention is all you need, in:

Advances in neural information processing systems, 2017, pp. 5998–6008. [21] C. Aggarwal, J. Han, Frequent Pattern Mining,

2014 ed., Springer, 2014. [22] X. Yan, J. Han, gspan: Graph-based substructure pattern mining, in: Proc. of the IEEE ICDM 2002,

IEEE, 2002, pp. 721–724. [23] M. Adda, et al., A framework for mining meaningful usage patterns within a semantically enhanced web portal, in: Proc. of the 3rd C* CCSE, 2010, pp. 138–147. [24] A. Cakmak, G. Ozsoyoglu, Taxonomysuperimposed graph mining, in: Proc. of the 11th EDBT, ACM, 2008, pp. 217–228. [25] S. Nijssen, J. Kok, Frequent graph mining and its application to molecular databases, in: IEEE Transactions on Systems, Man and Cybernetics, volume 5, IEEE, 2004, pp. 4571–4577. [26] H. Saigo, et al., gboost: a mathematical programming approach to graph classification and regression, Machine Learning 75 (2009) 69–89. [27] E. Spyropoulou, et al., Mining interesting patterns in multi-relational data with n-ary relation