Inducing a Decision Tree with Discriminative
Paths to Classify Entities in a Knowledge Graph

    Gilles Vandewiele, Bram Steenwinckel, Femke Ongenae, and Filip De Turck

 IDLab, Ghent University – imec, Technologiepark-Zwijnaarde 126, Ghent, Belgium
                       {firstname}.{lastname}@ugent.be


        Abstract. Deep-learning based techniques are increasingly being used
        for different machine learning tasks on knowledge graphs. While it has
        been shown empirically that these techniques often achieve better pre-
        dictive performances than their classical counterparts, where features
        are extracted from the graph, they lack interpretability. Interpretability
        is a vital aspect in critical domains such as the health and financial
        sector. In this paper, we present a technique that builds a decision tree of
        class-specific substructures in order to classify different entities within the
        knowledge graph. We show how our proposed technique is competitive
        to current state-of-the-art deep-learning techniques on four benchmark
        datasets, while being fully interpretable.

        Keywords: Knowledge Graphs · White-Box Machine Learning · Seman-
        tic Data Mining


1     Introduction

Graphs are data structures that are useful to represent ubiquitous phenomena
such as social networks, biological protein reactions and recommendation systems.
One of their strengths lies in the fact that they add an extra dimension to the data
by explicitly modeling interactions, through the form of edges, between individual
units (i.e. nodes) [6]. These graphs can also be used to represent knowledge bases,
which are repositories of domain or expert knowledge. Today, these graphs are
increasingly being leveraged for various machine learning tasks [23]. One of these
tasks is to classify nodes into one out of a set of discrete classes, which is the
focus of this study.

Different types of approaches can be identified in order to classify nodes in a
knowledge graph. A first group of approaches are classical ones. Here, information
about the structure of the graph is explicitly encoded into a feature vector, which
can then be fed to a machine learning model [9]. Examples of such features
are indications of the presence of specific local neighbourhood structures [10]

Copyright c 2019 for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).
Vandewiele et al.

and graph statistics [1]. The disadvantage of this type of approach is that it
is not agnostic: they need to be tailored specifically for the task at hand and
application domain at hand. Another popular classical approach, which is more
task-agnostic, is applying kernel methods [20], which measure similarity between
two knowledge bases, either directly on their graph representation [11,21,22] or
based on description logics [4]. Unfortunately, using pairwise similarity measures
as features is often a lot less interpretable than using human-understandable
variables. A second type of approach, which have been gaining immensely in
popularity, is representation learning. The goal is to create a mapping from
the graph-based structures onto low-dimensional numerical vectors that can be
used for downstream machine learning tasks [5]. These vectors can be created
through tensor factorization [13], or by applying unsupervised deep-learning
techniques, such as Word2Vec [12] on walks extracted from the graph [14,2,16].
Representation learning can be seen as completely task-agnostic, it is even the
case that representations can be re-used for multiple task. Moreover, these tech-
niques often tend to achieve higher performances than, for example, their kernel
or classical feature-based counterparts. The disadvantage of these approaches
is that by mapping an entity into a low-dimensional latent representation, all
interpretability is lost. A final and very recent approach are graph networks,
which are adaptations of neural networks that can directly work on graph-based
data [3,8,18]. Again, this technique can be seen as a black box, making it very
hard or even impossible to extract any insights from the model.

In this paper, we present a technique that can classify unseen entities or nodes
from the KG, given some already labeled entities. It does this by building a
decision tree consisting of useful substructures, extracted from the neighborhood
of the labeled entities, which are very discriminative for a certain class. The
technique is domain-agnostic, while resulting in a white-box model that uses
interpretable features to classify new unseen nodes.

2     Methodology
In this section, we first explain some fundamental concepts, followed by an
elaboration of the different steps of our algorithm.

2.1   Entity Classification: Problem Definition
Given a multi-relational directed knowledge graph G = (V, E, `), constructed
from a knowledge base of triples, where V are the vertices or entities in our graph,
E the edges or predicates and ` a labeling function that maps each vertex or
edge on its corresponding label. Moreover, we are provided with a list of entities
V with a corresponding vector of discrete labels y. Our goal is to construct a
model or hypothesis h(.) based on V and y that minimizes a loss function L(.),
and which generalizes well to unseen vertices:
                               arg min L(y, h(V ))
                                   h
                Decision Tree with Discriminative Paths to Classify Entities in a KG

2.2   Neighborhoods, Walks and Wildcards

We characterize each instance v ∈ V , by its neighborhood N (v) of a certain
depth d. The neighborhood of v is a graph that contains all vertices that can
be reached with d or less steps from the instance v. It can easily be extracted
by performing a breadth-first-traversal. As done by de Vries et al. [22], we first
transform our graph to remove its multi-relational aspect. To do this, each
(subject, predicate, object) triple from the original knowledge base can be
represented by three labeled nodes and 2 unlabeled edges (subject → predicate
and predicate → object). This transformation reduces the complexity of the
further elaborated procedures, without a loss of correctness, since a distinction
between entities and predicates is no longer needed.

We define a walk as a sequence of vertices. Due to our previously discussed
transformation, this sequence will consist of consecutive vertices and edges from
the original graph. The first vertex within this walk is often called the root of the
walk, which serves as a placeholder that is replaced by a specific vertex depending
on the context. We notate a walk as root → e0 → v1 → e1 → . . ..

We introduce a new special type of hop for our walks, which we call a ‘wildcard’
and notate by an asterisk ∗. The semantics of this wildcard is that any edge or
vertex label can be matched on that position in the sequence. This enables the
walks to have more expressive power. To illustrate this, imagine that the presence
of an entity of a specific type T is very discriminative for a certain class. It is
possible that only the fact that this entity is of that type carries information,
while the specifics of the entity itself are unimportant. As such, this could be
represented by a walk root → ∗ → ∗ → rdf : type → T .


2.3   Inducing a decision tree of discriminative walks

In this study, we will focus on a special type of walk. A walk of length l has a
root placeholder, followed by l − 2 wildcards and ending in a specific vertex v,
i.e. root → ∗ → ... → ∗ → v. As mentioned, the first hop, root, is replaced by v
whenever we want to search for it in its neighborhood N (v). Alternatively, we
can simply represent these types of walks by a tuple: w = (v, l). These type of
walks allow to be searched in a neighborhood in constant time, while already
possessing a rich amount of expressive power, as we will demonstrate empirically
further. When extracting a neighborhood of depth d, we keep track of d different
sets {N i (v) | 1 ≤ i ≤ d}, where N i (v) stores the nodes that can be reached in
exactly i hops. Whenever we want to search for a certain walk w = (v, l) of
that type in a neighborhood, we only need to check whether v appears in N l (v),
thus avoiding the need to traverse parts of the graph. This allows us to quickly
perform a brute-force search on all combinations of vertices v and lengths, up
until a specified maximum.
Vandewiele et al.

Our goal is to mine a walk w = (v, l) that maximizes information gain, which is
defined as the (weighted) reduction in entropy obtained by partitioning the data.
To calculate this, we partition our set of nodes into two mutually exclusive sets: a
set of nodes for which the walk can be found and a set of walks for which the walk
cannot be found. Afterwards, we can measure the entropy of the corresponding
labels in these two sets in order to calculate the difference.

Often, one walk is not enough to create a perfect separation between the different
classes in the feature space, especially when dealing with a multi-class problem.
Therefore, we recursively build a decision tree by partitioning our data after
mining the most discriminative walk into a set of nodes for which the walk can be
found and a set for which the walk cannot be found. While decision trees possess
excellent interpretability characteristics, they can be prone to overfitting [17].
Therefore, two hyper-parameters that allow for pre-pruning, which are conditions
on which the algorithm halts, are introduced. On the one hand, the algorithm
halts when a certain depth (max depth) is reached. On the other hand, the
algorithm stops when the amount of samples in a particular node of the decision
tree is lower than a specified amount (min samples leaf).


3   Results
In this section, we will demonstrate the predictive power of our proposed approach
by comparing its accuracy on benchmark datasets with accuracy scores for two
recent state-of-the-art techniques: rdf2vec [14] and Relational Graph Convo-
lutional Networks (r-gcn). We extracted four datasets, from varying domains,
from a public repository set up by Ristoski et al. [15]. For each of the benchmark
datasets, we use the same train-test partitioning as provided by the original
repository. For all classifiers, no pre-pruning was applied and trees were thus
grown until the training set was perfectly classified. The neighborhood depth,
and the maximum depth of the extracted walks was equal to 8, as was done
in the study of Ristoski et al [14]. For each dataset, we performed 5 runs. The
average accuracy scores achieved on the test set and their corresponding standard
deviations are summarized in Table 1. The results for r-gcn and rdf2vec are
taken directly from Schlichtkrull et al. [19]. As can be seen, the performances
of all three techniques are competitive to each other. Only on the AM dataset,
our technique performs slightly worse than both others, while on the MUTAG
dataset, our technique outperforms the two others, albeit with a higher variance.


4   Conclusion and Future Work
In this study, we proposed to build a decision tree in a recursive fashion. The
advantage of this approach is that it is fully interpretable and the model, or
at least the path to a certain prediction, can easily be visualised to a domain
expert. While our implementation is a first proof-of-concept, results are already
very promising as the technique is competitive to current state-of-the-art on four
                Decision Tree with Discriminative Paths to Classify Entities in a KG


                   Dataset     r-gcn       rdf2vec     Walk Tree
                    AIFB 95.83 ± 0.62 88.88 ± 0.00 89.44 ± 2.08
                    BGS 83.10 ± 0.80 87.24 ± 0.89 86.90 ± 1.38
                   MUTAG 73.23 ± 0.48 67.20 ± 1.24 73.82 ± 5.61
                     AM 89.29 ± 0.35 88.33 ± 0.61 86.77 ± 0.59
Table 1. The accuracy scores of Relational Graph Convolutional Networks (r-gcn),
rdf2vec and our proposed approach on four benchmark datasets.

tested benchmark datasets. It even outperforms current techniques on one of the
four datasets. Nevertheless, potential improvements can still be made. Examples
of possible extensions are a post-pruning phase and experimenting with other
splitting criteria than information gain. Moreover, ensemble techniques can be
used to construct a collection of different decision trees in order to reduce the
model variance. One straight-forward ensembling technique is bagging, where we
induce multiple decision tree on subsets of training instances, which are sampled
from the original training set through bootstrapping. One other alternative, that
has proven great success in the domain of timeseries classification [7], is to step
away from mining these discriminative substructures one by one in a recursive
fashion, but instead mine a large collection of substructures and then create
feature vectors based on the presence or absence of these substructures. Finally,
it should be noted that the concept of a walk is not a necessity in the proposed
approach, as only the last hop on that walk determines whether or not it can be
found in a neighborhood. Nevertheless, using a walk as representation is more
intuitive and easily allows for further extension upon this data structure, such as
heuristically filling in wildcards on the walk to try and improve the information
gain or plugging in more complex data structures than single vertices (e.g. trees
and subgraphs).


Code availability An implementation of the algorithm is available on Github1 .
Acknowledgements Gilles Vandewiele (1S31417N) and Bram Steenwinckel
  (1SA0219N) are both funded by a strategic base research grant of Fonds
  Wetenschappelijk Onderzoek (fwo). Femke Ongenae is funded by a Bijzonder
   OnderzoeksFonds (bof) grant of Ghent University.


References
 1. Bhagat, S., Cormode, G., Muthukrishnan, S.: Node classification in social networks.
    Social Network Data Analytics (2011)
 2. Cochez, M., Ristoski, P., Ponzetto, S.P., Paulheim, H.: Global rdf vector space
    embeddings. In: International Semantic Web Conference. pp. 190–207. Springer
    (2017)
 3. Defferrard, M., Bresson, X., Vandergheynst, P.: Convolutional neural networks on
    graphs with fast localized spectral filtering. In: Advances in neural information
    processing systems. pp. 3844–3852 (2016)
1
    https://github.com/IBCNServices/KGPTree
Vandewiele et al.

 4. Fanizzi, N., dAmato, C., Esposito, F.: Statistical learning for inductive query
    answering on owl ontologies. In: International Semantic Web Conference. pp. 195–
    212. Springer (2008)
 5. Goyal, P., Ferrara, E.: Graph Embedding Techniques, Applications, and Perfor-
    mance: A Survey. Knowledge-Based Systems (2017)
 6. Hamilton, W.L., Ying, R., Leskovec, J.: Representation Learning on Graphs: Meth-
    ods and Applications. Preprint of article to appear in the IEEE Data Engineering
    Bulletin (2017)
 7. Hills, J., Lines, J., Baranauskas, E., Mapp, J., Bagnall, A.: Classification of time
    series by shapelet transformation. Data Mining and Knowledge Discovery 28(4),
    851–881 (2014)
 8. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional
    networks. arXiv preprint arXiv:1609.02907 (2016)
 9. Latouche, P., Rossi, F.: Graphs in machine learning: an introduction. ESANN pp.
    207–218 (apr 2015)
10. Liben-Nowell, D., Kleinberg, J.: The Link Prediction Problem for Social Networks.
    Proceedings of the Twelfth Annual ACM International Conference on Information
    and Knowledge Management (CIKM) (November 2003), 556–559 (2003)
11. Lösch, U., Bloehdorn, S., Rettinger, A.: Graph kernels for rdf data. In: Extended
    Semantic Web Conference. pp. 134–148. Springer (2012)
12. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word repre-
    sentations in vector space. arXiv preprint arXiv:1301.3781 (2013)
13. Nickel, M., Murphy, K., Tresp, V., Gabrilovich, E.: A review of relational machine
    learning for knowledge graphs. Proceedings of the IEEE 104(1), 11–33 (2015)
14. Ristoski, P., Paulheim, H.: Rdf2vec: Rdf graph embeddings for data mining. In:
    International Semantic Web Conference. pp. 498–514. Springer (2016)
15. Ristoski, P., de Vries, G.K.D., Paulheim, H.: A collection of benchmark datasets for
    systematic evaluations of machine learning on the semantic web. In: International
    Semantic Web Conference. pp. 186–194. Springer (2016)
16. Saeed, M.R., Prasanna, V.K.: Extracting entity-specific substructures for rdf graph
    embedding. In: 2018 IEEE International Conference on Information Reuse and
    Integration (IRI). pp. 378–385. IEEE (2018)
17. Schaffer, C.: When does overfitting decrease prediction accuracy in induced decision
    trees and rule sets? In: European Working Session on Learning. pp. 192–205.
    Springer (1991)
18. Schlichtkrull, M., Kipf, T.N., Bloem, P., van den Berg, R., Titov, I., Welling, M.:
    Modeling Relational Data with Graph Convolutional Networks. ESWC (2017)
19. Schlichtkrull, M., Kipf, T.N., Bloem, P., Van Den Berg, R., Titov, I., Welling, M.:
    Modeling relational data with graph convolutional networks. In: European Semantic
    Web Conference. pp. 593–607. Springer (2018)
20. Vishwanathan, S., Schraudolph, N., Kondor, R., Borgwardt, K.: Graph Kernels.
    Journal of Machine Learning Research 11(Apr), 1201–1242 (2010)
21. de Vries, G.K.: A fast approximation of the weisfeiler-lehman graph kernel for
    rdf data. In: Joint European Conference on Machine Learning and Knowledge
    Discovery in Databases. pp. 606–621. Springer (2013)
22. de Vries, G.K.D., de Rooij, S.: Substructure counting graph kernels for machine
    learning from rdf data. Web Semantics: Science, Services and Agents on the World
    Wide Web 35, 71–84 (2015)
23. Wilcke, X., Bloem, P., De Boer, V.: The Knowledge Graph as the Default Data
    Model for Machine Learning. Data Science 1, 1–0 (2017)