=Paper=
{{Paper
|id=Vol-2925/paper5
|storemode=property
|title=Tackling scalability issues in mining path patterns from knowledge graphs: a preliminary study

|pdfUrl=https://ceur-ws.org/Vol-2925/paper5.pdf
|volume=Vol-2925
|authors=Pierre Monnin,Emmanuel Bresso,Miguel Couceiro,Malika Smaïl-Tabbone,Amedeo Napoli,Adrien Coulet
|dblpUrl=https://dblp.org/rec/conf/algos/MonninBCSNC20
}}
==Tackling scalability issues in mining path patterns from knowledge graphs: a preliminary study
==
<pdf width="1500px">https://ceur-ws.org/Vol-2925/paper5.pdf</pdf>
<pre>
        TACKLING SCALABILITY ISSUES IN MINING PATH PATTERNS
          FROM KNOWLEDGE GRAPHS : A PRELIMINARY STUDY ⇤


                         Pierre Monnin                                                 Emmanuel Bresso
          Université de Lorraine, CNRS, Inria, LORIA                      Université de Lorraine, CNRS, Inria, LORIA
                     F-54000 Nancy, France                                           F-54000 Nancy, France
                 pierre.monnin@loria.fr                                         emmanuel.bresso@loria.fr

                        Miguel Couceiro                                             Malika Smaïl-Tabbone
          Université de Lorraine, CNRS, Inria, LORIA                      Université de Lorraine, CNRS, Inria, LORIA
                     F-54000 Nancy, France                                           F-54000 Nancy, France
                miguel.couceiro@loria.fr                                          malika.smail@loria.fr

                        Amedeo Napoli                                                    Adrien Coulet
          Université de Lorraine, CNRS, Inria, LORIA                      Université de Lorraine, CNRS, Inria, LORIA
                     F-54000 Nancy, France                                           F-54000 Nancy, France
                 amedeo.napoli@loria.fr                                          adrien.coulet@loria.fr


                                                           A BSTRACT
           Features mined from knowledge graphs are widely used within multiple knowledge discovery tasks
           such as classification or fact-checking. Here, we consider a given set of vertices, called seed vertices,
           and focus on mining their associated neighboring vertices, paths, and, more generally, path patterns
           that involve classes of ontologies linked with knowledge graphs. Due to the combinatorial nature and
           the increasing size of real-world knowledge graphs, the task of mining these patterns immediately
           entails scalability issues. In this paper, we address these issues by proposing a pattern mining
           approach that relies on a set of constraints (e.g., support or degree thresholds) and the monotonicity
           property. As our motivation comes from the mining of real-world knowledge graphs, we illustrate
           our approach with PGxLOD, a biomedical knowledge graph.

Keywords Path · Path Pattern · Ontology · Knowledge Graph · Scalability

1       Introduction
Knowledge graphs [1] have a central role in knowledge discovery tasks. For example, Linked Open Data [2] have
been used in all steps of the knowledge discovery process [3]. In particular, features mined from knowledge graphs
have been used in multiple applications such as knowledge base completion [4], explanations [5], or fact-checking [6].
Here, we focus on knowledge graphs expressed using Semantic Web standards [7]. In this context, vertices are either
individuals that represent entities of a world (e.g., places, drugs, etc.), literals (e.g., integers, dates, etc.), or classes
of individuals (e.g., Person, Drug, etc.). Arcs are defined by triples hsubject, predicate, objecti in the Resource
Description Format language. Such a triple states that the subject is linked to the object by a relationship qualified
by the predicate (e.g., has-side-effect, has-name, etc.). Classes and predicates are defined in ontologies, i.e.,
formal representations of a domain [8], and organized into two hierarchies ordered by the subsumption relation. In
Semantic Web standards, individuals, classes, and predicates are identified by a Uniform Resource Identifier (URI). We
view such a knowledge graph as a directed labeled multigraph K = (⌃V , ⌃A , V, A, s, t, `V , `A ), where
    ⇤
    Supported by the PractiKPharma project, founded by the French National Research Agency (ANR) under Grant ANR15-CE23-
0028, and by the Snowball Inria Associate Team.
 Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
         • V is the set of vertices.
         • A is the set of arcs connecting vertices through predicates2 .
         • ⌃V is the set of vertex labels, here, their URI3 .
         • ⌃A is the set of arc labels, here, URIs of predicates of K.
         • s : A ! V (respectively t : A ! V ) associates an arc to its source (respectively target) vertex.
         • `V : V ! ⌃V (respectively `A : A ! ⌃A ) maps a vertex (respectively an arc) to its label.
Hence, a triple hs, p, oi is represented by two vertices vs , vo 2 V and an arc ahs,p,oi 2 A. The source and target vertices
of ahs,p,oi are respectively vs and vo , i.e., s(ahs,p,oi ) = vs and t(ahs,p,oi ) = vo . The labels of vs , vo , and ahs,p,oi are
respectively s, o, and p, i.e., `V (vs ) = s, `V (vo ) = o, and `A (ahs,p,oi ) = p.
In this work, we consider the task of mining features from K that are associated with a set of vertices of interest, which
we call seed vertices. The set of seed vertices can be defined in intension (i.e., all vertices that instantiate a specified
ontology class) or in extension (i.e., by specifying the list of their URIs). For example, in the biomedical domain, an
expert may be interested in mining features associated with vertices that represent drugs causing a specific side effect.
We propose to mine from K the three following kinds of features: neighboring vertices, paths, and path patterns.
Neighboring vertices are vertices that can be reached in K from at least one seed vertex. A neighbor is associated with
all seed vertices from which it is reachable. Its support counts such seed vertices. For example, in the knowledge graph
depicted in Figure 1, the neighbor v6 is reachable from the seed vertices nC 1 and n2 , and thus its support is 2.
                                                                                      C

                                  p
Paths are sequences of pairs ! e that represent an arc labeled by the predicate p incident to an individual e. A path is
associated with all seed vertices that root it in K. The support of a path counts such seed vertices. For example, the
            p1     p2                                        C p1    p2
support of ! v2 ! v3 is 1 since only nC    1 root it, i.e., n1 ! v2 ! v3 exists in K.
More generally, paths may share several characteristics. For instance, intermediate vertices in paths may instantiate the
same ontology classes. We propose to capture these characteristics by considering path patterns in addition to paths.
                                       p
Path patterns are sequences of pairs ! E, where p is a predicate and E is either an individual or a class. Such a pair
indicates that an arc labeled by p is incident to (i) E if E is an individual or (ii) an individual that instantiates E if E
is a class. A path pattern is associated with all seed vertices that root a path captured by the path pattern. Its support
counts such seed vertices. In the example graph depicted in Figure 1, v2 instantiates T1 , and v3 instantiates T2 . Thus,
 p1      p2                     p1      p2      p1      p2           p1      p2
 ! v2 ! v3 is captured by ! T1 ! v3 , ! v2 ! T2 , and ! T1 ! T2 . Since T2 is a subclass of T3 , it is also
                          p1       p2                 p1      p2                     p1      p2
captured by the pattern ! T1 ! T3 . Note that ! T1 ! T3 also captures ! v4 ! v5 , which is rooted by nC                    2.
                                 p1      p2
Consequently, the support of ! T1 ! T3 is 2. This illustrates the fact that path patterns may capture additional
common characteristics of seed vertices, and thus interestingly complete paths. Mining these patterns constitutes a
challenging task due to the combinatorial nature and the size of real-world knowledge graphs, which naturally entail
                                   p1      p2
scalability issues. For example, ! v2 ! v3 can be generalized by up to 11 path patterns.
This mining task and its inherent scalability issues constitute the main concerns of the present work. To the best of
our knowledge, works available in the literature do not address such issues in knowledge graphs with the adopted
granular modeling of path patterns. However, inspired by existing graph mining works [9], we propose an Apriori-based
approach that alleviates these scalability issues by relying on (i) a set of constraints (e.g., support or degree thresholds),
(ii) the hierarchy of ontology classes, (iii) an incremental expansion of paths and patterns, and (iv) the monotonic
character of the support of paths and patterns. We provide a reusable implementation on GitHub4 .
The remainder of this paper is organized as follows. In Section 2, we outline related works that motivated our proposed
approach. In Section 3, we present in details our approach to mine paths and path patterns, and discuss how it tackles
scalability issues. We illustrate our framework in Section 4 on PGxLOD, a real-world biomedical knowledge graph [10].
We comment on our results as well as indicate directions of future work in Sections 5 and 6.

2       Related work
Path patterns have been widely studied in different settings, for example graph rewriting [11] and query answering [12].
Here, we recall some works that tackle the problem of mining path patterns from knowledge graphs in different
application contexts.
    2
      Here, we discard literals from V and arcs that are incident to literals from A.
    3
      Hence, |⌃V | = |V |.
    4
      https://github.com/pmonnin/kgpm
                                                               p1                    p2
                                                  nC
                                                   1
                                                                         v2                  v3

                                                   p6
                                                                                          type
                                                          v8
           ...                      p4                               type                    T2         p3
                                                        type

           ...                               T5                     subClassOf        subClassOf
                                   type

           ...      ...                                         subClassOf
                            v1                            T1                     >                 T3         v6
                                                                                      subClassOf
                                   type
           ...                               T6                                       subClassOf
                                                                    subClassOf
                                                        type
           ...                      p4                               type                    T4         p3
                                                          v9
                                                   p6                                     type

                                      p5
                            v7                    nC
                                                   2
                                                                         v4                  v5
                                                               p1                    p2


Figure 1: Example of a canonical graph KC . nC 1 and n2 are canonical seed vertices, all vi are canonical individuals,
                                                        C

and all Ti are canonical ontology classes. Prefixes of URIs were omitted for readability purposes. The definition of
“canonical” is given in Subsection 3.1.


One major work dealing with feature mining from RDF graphs focuses on graph kernels that count common substructures
(i.e., walks, subtrees) [13, 14]. To avoid an explosion in the number of features, the authors remove patterns with a
low or high frequency [14]. Graph features can be used in various tasks such as knowledge base completion. For
example, AMIE [15] mines Horn clauses, i.e., conjunction of triples, to predict another triple. Similarly, in Context
Path Model [16], the authors model paths as sequences of predicates between the source and target entities, i.e.,
   p1 p2       pk 1 pk
s ! ! ···         ! ! t. Here, a triple is predicted based on the paths existing between the involved entities. Shi and
Weninger [6] also model paths as sequences of predicates but they use them from a fact checking perspective. They check
                    p                                                                           p1 p2       pk 1 pk
whether a triple s ! t is true by predicting it from a set of learned discriminative paths os ! ! · · ·        ! ! ot ,
where os and ot are respectively the set of classes instantiated by s and t. Our framework differs from the previous two
as we aim at mining features by exploring the graph from the given seed vertices and we model path patterns using the
intermediate entities and the ontology classes that they instantiate.
Our motivation also comes from explainable approaches that rely on the descriptive power of features mined from
knowledge graphs. For example, Explain-a-LOD [5] enriches statistical data sets with features from DBpedia. When
correlations can be established between statistics and DBpedia features, these features can be used as explanations for the
original statistics. For example, the quality of living in cities has been correlated with whether these cities are European
capitals. Explain-a-LOD leverages the different outputs of FeGeLOD [17], some of which corresponding to our
                                                   p                                                          p
approach. For instance, the so called relations ! e are paths, whereas the so called qualified relations ! t, where e is
replaced by a class t instantiated by e, are path patterns. Alternatively, Vandewiele et al. [18] propose to learn a decision
tree to classify entities based on paths of a knowledge graph. The authors suggest that the predictions of their model are
explainable as they are obtained by a “white-box” model (i.e., the decision tree) that combines interpretable features (i.e.,
paths from a knowledge graph). Interestingly, this system considers paths with their intermediate predicates and entities,
            p1          pk
i.e., root ! e1 · · · ! ek . They allow a generalization of both predicates and entities by the use of a wildcard (*). In
our context, this would correspond to generalizing entities by the top level ontology class >. However, unlike their
                                                                                                        *         *
framework, we do not generalize predicates. In their study, they focus on paths of the form root ! * · · · ! e, which
somewhat corresponds to extracting neighbors and their distance from seed vertices.
Finally, path patterns are somewhat similar to generalized association rules [19] and the concept of raising [20]. Indeed,
both works replace entities in rules by ontology classes to increase the support while preserving a high confidence.
Inspired by works that prune redundant generalized rules [21], our approach mines paths and path patterns that are
non-redundant and comply with some given constraints.
                                                                                         4. Optional and
                                                    1. Canonicalizing K
                                                                                     domain-dependent filtering

              Knowledge graph K
                                                   2. Mining interesting                    Binary matrix
                                                   neighbors and types                      M 2 B|N |⇥|F |

            Set of seed vertices N
                                                   3. Mining interesting
                                                  paths and path patterns


Figure 2: Main steps to mine a set F of features (i.e., neighbors, paths, and path patterns) associated with a set N of
seed vertices from a knowledge graph K. Step 4 is optional and depends on the application domain.

Table 1: Parameters that configure the mining of interesting neighbors, paths, and path patterns in a knowledge graph K.
Each parameter is associated with a domain and is used in specific steps (see Figure 2 for step numbers). Parameter m
is specific to the considered application. Here, we illustrate the role of m with the biomedical domain.


    Parameter     Domain             Steps   Description
    k             N +
                                     2, 3    Maximum length of paths and path patterns
    t             N                   3      Maximum level for generalization in class hierarchies
    d             N                  2, 3    Maximum degree (u = true) or out degree (u = false) to allow expansion
    lmin          N                  2, 3    Minimum support for features
    lmax          N                  2, 3    Maximum support for features
    u             B                  2, 3    Whether only out arcs (u = false) or all arcs (u = true) are traversed
    bpredicates   List of URIs       2, 3    Blacklist of predicates not to traverse
    bexp-types    List of URIs       2, 3    Blacklist of classes whose instances are not to reach
    bgen-types    List of URIs       2, 3    Blacklist of classes not to use in generalization
    m             {none, p, g,        4      Optional and domain-dependent filtering strategy
                  m, pg, pgm}                Illustrated here with the biomedical domain


3       Towards a scalable approach to mine interesting paths and path patterns
In this paper, we consider a knowledge graph K and a set of seed vertices N = {n1 , n2 , . . . , np } ✓ V . The task is to
mine neighbors, paths, and path patterns from K that are associated with these seed vertices. For example, given a set
of drugs that cause or not a side effect, we aim to mine features that can later be used to classify these drugs.
In the following subsections, we propose algorithms to build a binary matrix M of size |N | ⇥ |F| from the knowledge
graph K and the set of seed vertices N . The set F consists of interesting neighbors, paths, and path patterns mined from
K, i.e., neighbors, paths, and path patterns that satisfy the constraints defined in terms of the parameters summarized in
Table 1. These parameters will be detailed in the following subsections. M associates a seed vertex n 2 N with its
features f 2 F, i.e., if Mn,f = true, then n has feature f . We outline our approach in Figure 2 where steps 1, 2, and
3 are mandatory, while step 4 is optional and depends on the application domain.

3.1      Canonicalizing K

The first step of our approach consists in canonicalizing the knowledge graph K, i.e., unifying vertices that represent
the same real-world entity. We use the canonicalization word by analogy with the canonicalization of knowledge
bases, which consists in unifying equivalent individuals into one [4]. Indeed, in knowledge bases under the Open
Information Extraction paradigm, facts and entities can be represented by synonymous terms, which leads to co-existing
and equivalent individuals. For example, in such knowledge bases, two individuals Obama and Barack Obama can
co-exist. Similarly, in K, vertices can be connected through arcs labeled by the owl:sameAs predicate, indicating that
these vertices are actually representing the same real-world entity.
Such a situation typically arises when K comprises several data sets. For example, a drug can be represented by two
vertices linked by an owl:sameAs arc, resulting from the information extraction of two independent drug-related
databases. Therefore, their merging allows an easy access to the full extent of the knowledge in K about the drug they
represent. Such a canonicalization process corresponds to edge contraction in graph theory (i.e., taking graph quotient).
In our framework, it reduces to contracting arcs whose label is the owl:sameAs predicate.
To perform this canonicalization, we must respect the semantics associated with the owl:sameAs predicate, and thus
take into account its symmetry and transitivity. Indeed, an owl:sameAs arc between two vertices either is explicitly
stated in K or follows from existing arcs and these two properties. Let us consider a vertex v in K. The canonicalization
step merges v with all its identical vertices based on owl:sameAs arcs. To compute this set of vertices, it suffices
to compute the connected component of v in the undirected spanning subgraph formed by the owl:sameAs arcs of
K. Indeed, undirected edges comply with the symmetry of owl:sameAs and connected components comply with the
transitivity of owl:sameAs.
As a result, this step takes K as input and outputs its canonical graph KC = ⌃C
                                                                              V , ⌃A , V , A , s , t , ` V , ` A .
                                                                                   C    C   C C C C C

Similarly to K, K is a directed labeled multigraph where
                  C


            • V C is the set of canonical vertices.
            • AC is the set of canonical arcs connecting canonical vertices through predicates.
            • ⌃C
               V is the set of canonical vertex labels.

            • ⌃C
               A is the set of canonical arc labels.

            • sC : AC ! V C (respectively tC : AC ! V C ) associates a canonical arc to its canonical source (respectively
              target) vertex.
            • `C
               V : V
                     C
                          V (respectively `A : A ! ⌃A ) maps a canonical vertex (respectively a canonical arc) to its
                       ! ⌃C                C    C   C

              label.

Each canonical vertex in KC represents a vertex from K and all its identical vertices. It is possible for a canonical vertex
in KC to only represent one vertex v from K if v has no identical vertices. This corresponds to creating a surjective
mapping : V ! V C associating a vertex from K to its equivalent canonical vertex in KC . Canonical arcs in KC are
constructed by using to map the source and target vertices of arcs in K to canonical vertices. Similarly, the set of seed
vertices N is mapped to the set of canonical seed vertices, denoted by N C .
Remark 1. Note that storing URIs has a high memory footprint. Thus, in KC , we use indices in N instead of URIs to label
vertices and arcs, i.e., ⌃C
                          V ✓ N and ⌃A ✓ N. This leads to a reduced memory consumption in subsequent algorithms.
                                      C

Each canonical vertex has one unique label, differing from labels of other canonical vertices, i.e., |⌃C   V | = |V |. This
                                                                                                                   C

“relabeling” is inspired by the work of de Vries and de Rooij [13] that use a structure named pathM ap to represent a
path by an integer. We developed our own structure for this relabeling, which we named CacheManager.

3.2        Mining interesting neighbors and types

3.2.1       Mining interesting neighbors
Here, we select all vertices that are neighbors of at least one seed vertex in N C by performing a breadth-first search
constrained by parameters k, d, u, bpredicates , and bexp-types . Neighbors are selected by traversing at most k arcs from the
seed vertices in N C . If u = false, then only outgoing arcs are traversed; otherwise, all arcs are traversed regardless of
their orientation.
However, not all neighboring vertices are of interest. For example, we want to avoid provenance metadata vertices.
Indeed, they may not constitute discriminative features as they are specific to the vertex they describe. As we aim to
use ontology classes to generate path patterns, we also need to keep the graph exploration over the individuals of K
and avoid traversing rdf:type arcs. To this aim, we do not traverse arcs that are labeled by a predicate whose URI or
prefix of URI is blacklisted in bpredicates . For example, we blacklist in bpredicates the prefix of the provenance ontology
PROV-O5 and the URI of the rdf:type predicate6 .
Additionally, we provide a blacklist bexp-types of URIs or prefixes of classes whose instances must not be reached.
Hence, we do not reach individuals that instantiate directly or indirectly a blacklisted class, by following rdf:type
and rdfs:subClassOf arcs. For example, in a use case of classifying drugs that cause or not a side effect, one may
want to avoid neighbors that represent the side effect. That is why, the ontology class representing the side effect is
blacklisted in bexp-types .
      5
          http://www.w3.org/ns/prov#
      6
          http://www.w3.org/1999/02/22-rdf-syntax-ns#type
When mining neighboring vertices, we may encounter vertices with a high degree, hereafter named hubs. If the
graph exploration considered their numerous neighbors, then the size of the selected neighborhood would increase
exponentially, thus causing a scalability issue. Additionally, hub neighbors may not constitute specific and discriminative
features. Indeed, if a hub can be reached from some seed vertices, i.e., appears in their neighborhood, the neighbors
of the hub will be reached by the same seed vertices. That is why, in our approach, we propose to stop the graph
exploration at vertices whose degree is strictly greater than parameter d7 .
Remark 2. If u = false, then the degree of a vertex only counts outgoing arcs, otherwise all arcs are counted. The
degree does not count arcs whose predicate is blacklisted in bpredicates . The degree counts arcs incident to a vertex that
instantiates a blacklisted class in bexp-types .

As a result, with k, d, u, bpredicates , and bexp-types fixed, we obtain a set of neighboring vertices, denoted by N (N C ) ✓ V C .
Each neighboring vertex v 2 N (N C ) may only appear in the neighborhood of some seed vertices from N C w.r.t. the
parameters. Thus, v is associated with these seed vertices, which we indicate by defining the support set of v.
Definition 1 (Support set of a neighbor). For a given choice of these parameters, the support set of a neighboring
vertex v 2 N (N C ) is denoted by S UPPORT S ET(v) ✓ N C and defined as the set of seed vertices from N C having v as
neighbor. The support of a neighbor is defined as the cardinal of its support set.

Note that some vertices in N (N C ) are not very discriminative: when they are associated with very few vertices from
N C or nearly all of them. This motivates the use of parameters lmin and lmax that define the minimum and maximum
support for a neighbor to appear in the set F of features. Hence, a neighbor v 2 N (N C ) constitutes a feature in the
output matrix M if and only if lmin  |S UPPORT S ET(v)|  lmax . We denote the set of interesting neighbors to appear
in F by
                        Nl (N C ) = v | v 2 N (N C ) and lmin  |S UPPORT S ET(v)|  lmax .
In M, for nC 2 N C and v 2 Nl (N C ), we have MnC ,v = true if and only if nC 2 S UPPORT S ET(v).
Example 1. From KC in Figure 1 and with k = 3, d = 4, lmin = 2, lmax = 3, u = false, bpredicates =
{type, subClassOf}, and bexp-types = ;, we obtain:

                                    N (N C ) = {v1 , v2 , v3 , v4 , v5 , v6 , v8 , v9 } ;
                         S UPPORT S ET(v1 ) = S UPPORT S ET(v6 ) = nC    C
                                                                    1 , n2 ;
                         S UPPORT S ET(v2 ) = S UPPORT S ET(v3 ) = S UPPORT S ET(v8 ) = nC
                                                                                         1 ;
                         S UPPORT S ET(v4 ) = S UPPORT S ET(v5 ) = S UPPORT S ET(v9 ) = nC
                                                                                         2 ;
                                   Nl (N C ) = {v1 , v6 } .
The graph exploration stops at v1 . Indeed, it is considered a hub as its degree is greater than d. Thus, vertices on the left
of Figure 1 are not explored. Because u = false, the graph exploration cannot reach v7 . If bexp-types = {T3 }, then v3
and v5 cannot be traversed, resulting in N (N C ) = {v1 , v2 , v4 , v8 , v9 }.

3.2.2   Mining interesting types
Observe that we can use N (N C ) to compute interesting types over the considered neighborhood. These interesting
types will alleviate a scalability issue arising when building path patterns in Subsection 3.3. Interesting types must be
computed over N (N C ) and not Nl (N C ) as vertices whose support is below lmin can instantiate interesting types. As
an intuitive example, in Figure 1, T3 is associated with both nC1 (because of v3 ) and n2 (because of v5 ). For lmin = 2,
                                                                                         C

v3 and v5 will not be selected as features, however T3 can be used in path patterns.
Parameters t and bgen-types constrain the ontology classes considered in the construction of path patterns, and thus they
are integrated in the computation of interesting types. Parameter t specifies the maximum level of considered classes
in ontology hierarchies. This level is computed by starting at vertices to generalize and following rdf:type and
rdfs:subClassOf arcs. t = 0 only allows to generalize vertices with >, which is considered to be instantiated by
all vertices. For example, v3 can be generalized by T2 and > if t = 1, and by T2 , T3 , and > if t = 2. Additionally,
types used for generalization must not be blacklisted in bgen-types . This blacklist consists of URIs or prefixes of ontology
classes not to be used during the construction of path patterns. For example, we refrain from considering general classes
such as pgxo:Drug8 .
   7
     From a similar assessment, de Vries and de Rooij [14] tackle the hub issue by removing edges based on frequency of pairs
(source, predicate) and (predicate, target).
   8
     http://pgxo.loria.fr/Drug
To compute interesting types, we must first compute their support set. This motivates the following predicate:
                                            ⇢
                                              true if v instantiates T under parameters t and bgen-types
               inst(v, T, t, bgen-types ) =
                                              false otherwise
We can then define the support set of an ontology class T as follows:
Definition 2 (Support set of an ontology class). The support set of an ontology class T is the union of support sets of
vertices v 2 N (N C ) that can be generalized by T under parameters t and bgen-types . Formally:
                                                           [
                                S UPPORT S ET(T ) =                   S UPPORT S ET(v)
                                                            v2N (N C )
                                                        inst(v,T,t,bgen-types )


Finally, the set of interesting types used to build path patterns is defined as T lmin = {T | lmin  |S UPPORT S ET(T )|}.
Example 2. With the same parameters as in Example 1, we obtain
      S UPPORT S ET(T2 ) = nC
                            1 ;             S UPPORT S ET(T1 ) = nC    C
                                                                  1 , n2 ;             T lmin = {T1 , T3 , T5 , T6 , >} .

3.3   Mining interesting paths and path patterns

This step focuses on mining interesting paths and path patterns rooted by seed vertices nC 2 N C . We use the term path
feature (PF) to indicate a path or a path pattern.
                                                                                                 p
Definition 3 (Path feature). A path feature is a sequence of atomic elements that are pairs ! E where p is a predicate
and E is either (i) an individual (for paths), or (ii) an individual or an ontology class (for path patterns). The length of a
path feature counts the number of its atomic elements.
                                      p1       p2      p3
Example 3. In Figure 1, the path ! v2 ! v3 ! v6 can be rooted by nC
                                                                  1 and is of length 3. The path pattern
p1    p2
 ! T1 ! T3 can be rooted by nC 1 and n C
                                       2 and is of length 2.

Interesting path features are built by a breadth-first expansion starting at vertices in N C . As previously, k defines the
maximum number of arcs traversed. Hence, path features are of length 1 to k. Observe that a scalability issue arises
when mining interesting path features. Indeed, there may be several paths between two vertices and each vertex in a
                                                                                  p1     p2      p3
path can be generalized by several ontology classes. For instance, for t = 2, ! v2 ! v3 ! v6 can be generalized
by up to 23 path patterns. We propose a mining procedure that alleviates the scalability issues associated with the
mining of path patterns. This mining procedure relies on the monotonicity of the support set of path features, which is
defined as follows:
Definition 4 (Support set of a path feature). The support set of a path consists of all vertices from N C that root it in
KC . Formally,
                                pa         pb
                                                       n                 pa          pb
                                                                                                        o
                S UPPORT S ET( ! va . . . ! vb ) = nC 2 N C | nC ! va . . . ! vb exists in KC .

The support set of a path pattern consists of all vertices from N C that root a path in KC that is captured by the path
pattern. Formally,
                   pa         pb
                                         n                   pa        pb
  S UPPORT S ET( ! Ea . . . ! Eb ) = nC 2 N C | nC ! va . . . ! vb exists in KC and
                                                                                                                          o
                                                                           8vi , vi = Ei or inst(vi , Ei , t, bgen-types ) .

The support of a path feature is defined as the cardinal of its support set.
                                                         p1        p2         p3
Example 4. From Figure 1, we have S UPPORT S ET( ! v2 ! v3 ! v6 ) = nC
                                                                     1 .

Our approach of mining interesting path features is guided by the dependency structure as illustrated in Figure 3. At
first, this structure is empty and it is then augmented at each iteration of Algorithm 1, whose operations are described
and illustrated below.
Remark 3. As introduced in Remark 1, there are some hacks to help mitigating some of the scalability drawbacks.
Storing and manipulating a path feature as a list of elements has a high memory footprint. Thus, such a list is only
stored once in our CacheManager structure and the returned index (from N) is used in mining algorithms.
                    h=1                                                            h=2                                                      h=3
                    p6                                                        p1   2p                                                  p1   p
                                                                                                                                            2    3   p
                    ! v8                                                      ! v2 ! v3                                                ! v2 ! v3 ! v6
                     nC
                      1                                                            nC
                                                                                    1                                                           nC
                                                                                                                                                 1


     p6             p6                p6
      ! T5          ! T6              !>
                                           C        p1   2 p                  p1   2 p                p1    p2                         p1   p
                                                                                                                                            2    3   p
          C              C                                 T3                                                  v3
    nC
     1 , n2        nC
                    1 , n2           nC
                                      1 , n2        ! v2 !                    ! v2 ! >                  ! T1 !                         ! v2 ! v3 ! >
                                                         nC
                                                          1                        nC
                                                                                    1                      nC
                                                                                                            1                                   nC
                                                                                                                                                 1


                    p6
                    ! v9
                     nC
                      2
                                                               p1   2   p                p1   2   p                 p1   p
                                                                                                                         2    3    p                     p1   2p   3    p
                                                               ! T1 ! T3                 ! T1 ! >                   ! T1 ! T3 ! v6                       ! T1 ! T3 ! >
                                                                          C                         C                          C                                    C
                                                                    nC
                                                                     1 , n2                   nC
                                                                                               1 , n2                    nC
                                                                                                                          1 , n2                              nC
                                                                                                                                                               1 , n2
     p4                      p1
      ! v1                    ! v2
          C
    nC
     1 , n2                   nC
                               1


                                                    p1   2 p                  p1   2 p                p1    p2                         p1   p
                                                                                                                                            2    3   p
                                                    ! v4 ! T3                 ! v4 ! >                  ! T1 ! v5                      ! v4 ! v5 ! >
     p4             p1                p1
      !>            ! T1              !>                 nC
                                                          2                        nC
                                                                                    2                      nC
                                                                                                            2                                   nC
                                                                                                                                                 2
          C              C                 C
    nC
     1 , n2        nC
                    1 , n2           nC
                                      1 , n2


                             p1                                               p1   2p                                                  p1   p
                                                                                                                                            2    3   p
                             ! v4                                             ! v4 ! v5                                                ! v4 ! v5 ! v6
                              nC
                               2                                                   nC
                                                                                    2                                                           nC
                                                                                                                                                 2


Figure 3: Dependency structure used when mining interesting path features from the canonical graph in Figure 1. Parameters are k = 3, t = 2, d = 4, u = false,
bpredicates = {type, subClassOf}, bexp-types = ;, bgen-types = ;, lmin = 2, and lmax = 3. Path features are displayed with their support set. Solid arrows represent
expansion, dashed arrows represent generalization, rectangles represent paths, and rounded rectangles represent path patterns. Hatched rectangles represent discarded
path features because of support limits or more specific path features with identical support set. Green rectangles represent path features ultimately added to F. Blank
rectangles represent path features that respect specificity and support constraints but are not in F because of other features (in green). For readability purposes, path
features are displayed as the list of their elements instead of indices from N actually used to save memory.
Algorithm 1 Mining interesting paths and path patterns
Input: The canonical knowledge graph KC , the set of canonical seed vertices N C , the set of interesting types T lmin
Parameters: k, t, d, lmin , lmax , u, bpredicates , bexp-types , bgen-types
Output: F and M completed with interesting paths and path patterns
 1: h     1
 2: repeat
 3:   Expand paths in Ph
 4:   Generalize expanded paths into path patterns
 5:   Keep most specific path features
 6:   Select (i) generated path features to add to F (complete M accordingly), and (ii) paths to add to Ph+1
 7:   h     h+1
 8: until h > k or Ph = ;


3.3.1   Expand paths in Ph
                                                 pe
Each path P 2 Ph 9 is expanded with pairs ! ve that are chosen in the neighborhood of the last individual of P . This
choice is constrained by parameters k, d, u, bpredicates , and bexp-types as in Subsection 3.210 . In the first iteration, the
neighborhood of seed vertices in N C is used. If no path in Ph can be expanded, i.e., the neighborhood of their last
vertex does not contain reachable vertices under the constraints, then Algorithm 1 ends.
                                                                                                           p4        p1       p1
Example 5. From the graph in Figure 1, the first expansion generates the following paths: ! v1 , ! v2 , ! v4 ,
 p6          p6                             p4
 ! v8 , and ! v9 . In the second expansion, ! v1 is not expanded as v1 is a hub under d = 4. Since their respective
                                                  p6        p6                                               p1
neighborhood does not contain reachable vertices, ! v8 and ! v9 are also not expanded. The expansion of ! v2
          p1    p2                                p1             p1      p2
generates ! v2 ! v3 whereas the expansion of ! v4 generates ! v4 ! v5 .

3.3.2   Generalize expanded paths into path patterns
                                                                                     pe
Let Pe be the expansion of P 2 Ph as previously described, i.e., Pe = P ! ve . We generalize Pe by:
                                     pe
        • Generating patterns P ! T with types T 2 T lmin for which the predicate inst(ve , T, t, bgen-types ) is verified.
        • Retrieving from the dependency structure the path patterns that generalize P 11 and expanding them with
          pe          pe
           ! ve , and ! T for all T 2 T lmin for which the predicate inst(ve , T, t, bgen-types ) is verified.

Intuitively, this generalization operation allows to expand path patterns.
                                       p1                          p1           p1                         p1
Example 6. In the first iteration, ! v2 is generalized by ! T1 and ! >. As we will see, ! > is not kept in the
                                                                                                                   p1
dependency structure at the end of the first iteration (see Subsections 3.3.3 and 3.3.4). In the second iteration, ! v2
             p1      p2                                                  p1                                       p1
expands into ! v2 ! v3 . In the dependency structure, we retrieve ! T1 as the path pattern generalizing ! v2 ,
                        p1      p2      p1       p2           p1     p2
which is expanded into ! T1 ! v3 , ! T1 ! T3 , and ! T1 ! >.

We only generalize paths with types T 2 T lmin to avoid the generation an important number of uninteresting path
patterns that would then be discarded, thus reducing the memory footprint. Indeed, by definition, if T 62 T lmin , then
|S UPPORT S ET(T )| < lmin . Additionally, given a path feature P , |S UPPORT S ET(P )|  minE2P |S UPPORT S ET(E)|,
where E can be a class or an individual involved in P . Thus, if T 62 T lmin is used in a path pattern P , we would have
|S UPPORT S ET(P )| < lmin , therefore generating an uninteresting path pattern that would be discarded later.

3.3.3   Keep most specific path features
Inspired by works that prune redundant generalized rules during their generation [21], we keep only the “most specific”
path patterns among those that have the same support set.
   9
      Ph is explained in Subsection 3.3.4.
  10
      Additionally, to avoid loops, P can only be expanded at iteration h with individuals ve such that there exists at least one seed
vertex in S UPPORT S ET(P ) whose shortest distance to ve is h.
   11
      Not all generated path patterns remain in the dependency structure at the end of an iteration, see Subsections 3.3.3 and 3.3.4.
Definition 5 (More specific path pattern). A path pattern P1 is more specific than another path pattern P2 if every
                                                                                                            p1
atomic element of P1 is more specific than the atomic element of P2 at the same position. An atomic element ! E1 is
                                            p2
more specific than another atomic element ! E2 if and only if:

        (i) p1 = p2 , i.e., both atomic elements involve the same predicate12 , and
       (ii) E1 is more specific than E2 13 .

When path patterns have the same support set, keeping the most specific ones remove redundant generalizations, thus
reducing their number and the computational burden. Additionally, we ensure a high descriptive power because the
most specific paths are the most descriptive. Intuitively, a path pattern involving a class T is less descriptive than
another pattern involving a subclass or an instance of the class. However, keeping the most specific path patterns is
computationally expensive, which led us to propose the following computational procedure.
We notice that the support set of a path pattern is the union of the support sets of the paths it generalizes. Therefore,
we discard path patterns that generalize only one path. Indeed, such path patterns have the same support set as their
original path and are more general, by definition.
                                                                              p1       p2       p3           p1      p2       p3
Example 7. For h = 3, we discard the following path patterns: ! v2 ! v3 ! > and ! v4 ! v5 ! >.

However, there may exist path patterns that generalize several more specific path features while having the same support
set. Such path patterns should also be discarded.
                               p1                                                                              p1
Example 8. For h = 1, ! > shares the same support set as the more specific path pattern ! T1 and thus should be
discarded.

To efficiently discard path patterns, we avoid computing their whole hierarchy. Instead, we focus on retaining only the
most specific ones in the prefix tree depicted in Figure 4. This prefix tree is incrementally augmented and stores the most
specific path patterns for a specific iteration and support set. In this tree, individuals/classes and predicates involved in
path patterns are indexed separately. Thus, its depth is twice the length of path patterns of the current iteration.
The prefix tree enables an efficient storage and selection of the most specific path patterns. Indeed, let P be a path
pattern to be compared with those already stored. A breadth-first traversal is performed to detect more specific patterns
than P : we only traverse identical or more specific elements according to Definition 5. At any depth, if such elements
cannot be found, then P is one of the most specific patterns and the traversal stops. On the contrary, if the traversal
reaches a leaf containing more specific elements, then there is a pattern more specific than P with the same support set.
Consequently, P is discarded and removed from the dependency structure.
If P is to be stored, another breadth-first traversal is performed by considering identical or more general elements than
the ones in P . When the traversal reaches a leaf, it means that more general patterns than P are currently stored and
have the same support set. These are removed from the prefix tree before storing P . They are also removed from the
dependency structure.
Remark 4. As the prefix tree relies on associative arrays and sets, the computational cost of traversal, insertion, and
removal is reduced. Additionally, resetting the tree at each expansion and each support set reduces the number of
patterns to traverse and thus the global computational cost.

3.3.4      Select generated path features to add to F, and paths to add to Ph+1
In this subsection, we determine (i) which paths and path patterns should be added as features in F, and consequently,
(ii) which paths to add to Ph+1 , i.e., to expand during the next iteration.
A path feature P can be added in F if:

   (C1) lmin  S UPPORT S ET(P )  lmax ,
   (C2) Prefixes of P with the same support set do not already exist in F,
   (C3) If P is a path pattern, it does not generalize a path with the same support set.

When P is in F, the output binary matrix M verifies MnC ,P = true for all nC 2 S UPPORT S ET(P ).
  12
       It is noteworthy that we do not consider the hierarchy of predicates in this work.
  13
       A class is more specific than all its super-classes and an individual is more specific than all classes it instantiates.
                    h=1                                                       h=2
             Support set nC    C
                          1 , n2
                                                                       Support set nC    C
                                                                                    1 , n2


                                  {>}
            p1                                    p1                  T1                  p2                {T3 }

            p6
                                {T5 , T6 }


Figure 4: Prefix tree used when computing most specific path patterns during the first and second iterations considering
the support set nC  1 , n2 . This structure is reset at each expansion and for each support set. For h = 1, when
                         C
             p1                                p1
considering ! T1 , the structure will evolve: ! > will be removed and replaced by the considered path. For h = 3,
                    p1     p2                                           p1      p2
when considered, ! T1 ! > will be discarded as more general than ! T1 ! T3 .


Remark 5. Note that (C2) allows to focus on shorter paths in F. However, (C2) is not applied if P ends with an
individual and there exist a prefix of P in F that ends with a class. P is considered more descriptive than its prefix,
because of the individual in the last position. Hence, we add P in F and remove its prefix. For example, in Figure 3,
 p1     p2                     p1      p2      p3
 ! T1 ! T3 is replaced by ! T1 ! T3 ! v6 in F for h = 3.
Remark 6. (C3) is motivated as the path is more specific and thus more descriptive than P and should be added instead
of P .

We also select the paths and path patterns to expand during the next iteration. To reduce their number, we rely on the lmin
constraint and the monotonicity of the support set. It is clear that, for a path feature P , we have |S UPPORT S ET(P )| 
minE2P |S UPPORT S ET(E)|, where E can be a class or an individual involved in P . Thus, when expanding a path
feature, its support set remains identical (for paths and path patterns) or decreases (for path patterns).
Consequently, we add to Ph+1 paths whose expansion may generate path features complying with the lmin constraint,
i.e., paths with a support greater than lmin or paths that are generalized by a pattern with a support greater than lmin .
This monotonicity property also lets us remove from the dependency structure patterns whose support is lower than
lmin . Indeed, such patterns cannot be used during the generalization step of the next iteration since they will inevitably
generate a pattern whose support is smaller than lmin . As a result, the monotonicity property does entail a reduction in
the number of paths and path patterns considered in the next iteration, thus reducing the computational cost.
                                                              p1                                               p1
Example 9. For example, in the first iteration, the path ! v2 is added to P2 as it is generalized by ! T1 whose
                                                                                   p1  p2
support is greater than lmin = 2. At the end of the second iteration, we remove ! v2 ! T3 because its support is
lower than lmin = 2, and thus its expansion cannot generate an interesting pattern.


3.4   Optional and domain-dependent filtering

After the previous steps, we obtain a feature set F containing interesting neighbors, paths, and path patterns. These
features have been mined without taking into account domain constraints known to experts. We propose to apply
domain-dependent filtering on F with parameter m. Such filters reduce the size of F and integrate interestingness
constraints based on expert knowledge.
Example 10. To classify drugs causing or not a side effect, experts may want to focus on features containing a
biological pathway, a gene or a GO class, or a MeSH class. Therefore, we propose three atomic filters, only keeping
neighbors, paths, and path patterns containing at least a pathway (m = p), a gene or a GO class (m = g), or a MeSH
class (m = m). Such atomic filters can be combined to form disjunctive filters. For example, the m = pg filter keeps
features from F containing at least a pathway or a gene or a GO class. When a filter is applied to a neighbor, this
neighbor must be, e.g., a pathway for the p filter. When a filter is applied to a path or a path pattern, it means that one of
its individuals / ontology classes must be, e.g., a pathway for the p filter.

This domain-dependent filtering is similar to approaches that generalize association rules and prune those that involve
some specified ontology classes [21, 22].
4        Experimental setup
To illustrate our approach, we will address the following task: from a knowledge graph, mine a set of features to classify
drugs depending on whether they cause a specific side effect.
We explore PGxLOD14 [10], a knowledge graph that aggregates several sets of Linked Open Data (LOD) describing
drugs, phenotypes, and genetic factors: PharmGKB, ClinVar, DrugBank, SIDER, DisGeNET, and CTD. This aggre-
gation may lead to features combining units from several LOD sets. Indeed, LOD sets may contain different and
incomplete knowledge. Their combined use then enables leveraging a greater amount of knowledge, where some LOD
sets complete information provided by others. This asks for a canonical knowledge graph as described in Subsection 3.1.
For instance, it is possible to complete the knowledge related to a drug described in PharmGKB if it is linked with an
owl:sameAs arc to the same drug described in DrugBank. This constitutes the key interest in combining LOD sets in
knowledge discovery and data mining tasks, as discussed by Ristoski and Paulheim [23].
We will use the following data sets that comprise positive ( ) and negative ( ) drug examples:
Data set 1 (Drug Induced Liver Injury (DILI) [24]). It is formed by 1,036 drugs in 4 classes: “most DILI concern”
(192 drugs), “ambiguous DILI concern” (254 drugs), “less DILI concern” (278 drugs), and “no DILI concern” (312
drugs). We mapped these drugs from their PubChem identifiers to identifiers from PharmGKB, otherwise DrugBank,
otherwise KEGG, resulting in the set of seed vertices N DILI = N DILI [ N DILI such that:

          • |N DILI | = 146 drugs (118 from PharmGKB, 17 from DrugBank, and 11 from KEGG). The positive drug
            examples are from the “most DILI concern” class.
       • |N DILI | = 224 drugs (206 from PharmGKB, 9 from DrugBank, and 9 from KEGG). The negative drug
         examples are from the “no DILI concern” class.
Data set 2 (Severe Cutaneous Adverse Reactions (SCAR)15 ). It is formed by 874 drugs in 5 classes: “very probable”
(18 drugs), “probable” (19 drugs), “possible” (94 drugs), “unlikely” (697 drugs), and “very unlikely” (46 drugs). We
mapped these drugs from their PubChem identifiers to identifiers from PharmGKB, otherwise DrugBank, otherwise
KEGG, resulting in the set of seed vertices N SCAR = N SCAR [ N SCAR such that:

          • |N SCAR | = 102 drugs (100 from PharmGKB and 2 from DrugBank). The positive drug examples are from the
            “very probable”, “probable”, and “possible” classes.
          • |N SCAR | = 290 drugs (286 from PharmGKB and 4 from DrugBank). The negative drug examples are from
            the “unlikely” and “very unlikely” classes.
We implemented our approach in Python16 . We used a server with 700 GB of RAM and the following parameter values
k 2 {1, 2, 3, 4}, t 2 {1, 2, 3}, d = 500, u = false, lmin = 5, lmax = +1, and m 2 {p, g, m, pg, pgm}. It should be
noted that k = 4 was only tested with t = 1 because of memory issues caused by the high number of generated features.
Statistics about the features are detailed for k = 3, t = 3 and k = 4, t = 1 in Table 2 and discussed in the next section.
We obtained the features associated with the DILI data set under k = 3, t = 3 in approximately 1 hour. However,
computing the features with k = 4, t = 1 on the same data set required 4 days and 380 GB of RAM.

5        Results and discussion
The first two lines of Table 2 show the number of neighbors and types reachable before applying support limits.
Enforcing these limits constitutes a first reduction of these numbers, thus reducing the memory and computational
footprints. Here, numbers are approximately divided by 2. The mining of path features always relies on lmin , and
thus it is not possible to count the number of all possible paths and path patterns. However, we show the number of
path features generated during the mining, which already illustrates the combinatorial explosion. Enforcing support
constraints and removing redundant generalizations allow to reduce their number in F (here, approximately by 20).
Finally, the domain-dependent filtering defined by m also radically scales down the number of features ultimately
output. However, this filtering only happens as post-processing and does not alleviate the scalability issues arising
during the mining of patterns.
We observe a drastic increase in the number of neighbors and path features alongside k, which highlights the scalability
issues of mining large knowledge graphs. Considering additional levels in ontology hierarchies by increasing t also
    14
       https://pgxlod.loria.fr
    15
       http://www.regiscar.org/
    16
       https://github.com/pmonnin/kgpm
Table 2: Number of features mined for the two considered data sets for parameters d = 500, u = false before and
after applying lmin = 5, lmax = +1, and m = pgm. Path features are always computed considering at least lmin to
avoid combinatorial explosion. We display the number of path features generated (gen.) during the exploration and the
number of path features ultimately in F after keeping the most specific and applying limit constraints. Types are not
considered as features but are displayed to illustrate the possible combinatorial explosion in path patterns. Statistics
about the full neighborhood are given as comparison.

                                                                DILI                            SCAR
                                                     k = 3, t = 3 k = 4, t = 1        k = 3, t = 3 k = 4, t = 1
                              Neighbors                175,652          628,681         179,694          639,050
      Before lmin and lmax
                              Types                     13,580           18,940          13,677           18,526
                              Neighbors                71,560          281,657          90,690          312,029
                              Types                     7,372           12,421           8,800           13,177
       After lmin and lmax
                              Path features in F      790,605         10,169,975       1,146,585       10,729,002
                              Path features gen.     20,145,635      251,791,519      29,011,996      255,672,772
                              Neighbors                 4,069           31,804           4,214           33,361
             After m
                              Path features in F       102,674         2,291,846        147,936         2,400,642
                              Neighbors                       2,419,957                        2,419,920
       Full neighborhood
                              Types                             51,477                           51,472
           (d = 500)
                              Reached at k, t               k = 23, t = 21                   k = 23, t = 21
                              Neighbors                       5,488,531                        5,488,510
       Full neighborhood
                              Types                             53,486                           53,484
           (d = +1)
                              Reached at k, t               k = 19, t = 21                   k = 20, t = 21


multiplies the number of path features. To illustrate, with 700 GB of RAM, we could not set t to values greater than
1 for k = 4. Consequently, there is still room for improvements in terms of memory consumption, e.g., through an
efficient storing of paths and path patterns. These future improvements could enable to consider the full neighborhood
of seed vertices, which is reached for greater values of k and t and involves a far greater amount of vertices. For
example, for the DILI data set, the full neighborhood is reached for k = 23 and t = 21 (for d = 500), or k = 19 and
t = 21 (when the degree constraint is disabled). The full amount of reachable neighbors is 4 times (d = 500) or 9 times
greater (d = +1) than with k = 4.
The values of parameters depend on the objectives and domain knowledge of the analyst guiding the mining process,
especially for blacklists and support thresholds. However, metrics about the knowledge graph may provide guidance.
Indeed, statistics about node degrees can help to find a trade-off between exploration and combinatorial explosion with
parameter d. Similarly, the depth of class hierarchies influences the value of t. For example, general classes may not
be of interest to the analyst, thus reducing t. The parameter k can be set by considering the graph diameter. As it is
common in mining processes, iterations may be required to find the best configuration. Regarding m, it is for now
hard-coded and only suitable to some biomedical applications. Inspired by ontologies that allow to interactively define
mining workflows [3], we could adapt this parameter to other applications by proposing such an interactive definition.
When manually reviewing the output features, we noticed multiple path features across the aggregated LOD sets. This
is made possible by aggregating and canonicalizing multiple LOD sets in the knowledge graph. This result particularly
illustrates one of the fundamental aspects of Linked Open Data: the combination of different data sets enables to go
beyond their original purposes and coverage. However, it is clear that combining LOD sets leads to bigger knowledge
graphs, exacerbating the scalability issues.
Regarding our approach, we only canonicalize vertices, i.e., individuals and ontology classes. Nevertheless, predicates
used on arcs can also be identified as identical, leading to a canonicalization of arcs. In this context, we could benefit
from matching approaches, such as PARIS [25]. By identifying identical classes, predicates, and individuals, these
matching approaches could further improve the canonicalization and, therefore, increase the number of common
features between seed vertices from different data sets. Similarly, we could consider literals and arcs incident to literals
that were purposely discarded here. However, the canonicalization of literals raises several challenging issues due
to their heterogeneity in terms of syntactic variations, unit measures, and the precision of numerical values. Other
reasoning mechanisms and semantics associated with Semantic Web standards could be taken into account. For example,
predicates can be defined as transitive, and thus the canonical knowledge graph could also result from their transitive
closure.
Regarding the modeling of path patterns, we could generalize paths with both the hierarchy of classes and the hierarchy
of predicates. In addition to keeping the most specific patterns, we could use other metrics to further reduce their
redundancy (e.g., approaches relying on hierarchies [26, 27] and extents of ontological classes [27]). This could also
reduce the number of generated patterns, therefore improving the scalability of the mining approach. Neighbors could
be enriched with the distance between them and seed vertices, which would correspond to the generalized paths of
KGPTree [18]. We could also use other approaches than binary features (e.g., counting [13, 14], relative counting [28]).
More importantly, it remains to test our mined features within a complete classification task to measure the influence of
k, t, and the three kinds of features (neighbors, paths, and path patterns).


6   Conclusion

In this preliminary study, we addressed the scalability issues associated with the task of mining neighbors, paths, and
path patterns from a knowledge graph and a set of seed vertices. We proposed a method for tackling these issues,
which we illustrated by mining a real-world knowledge graph. Our results highlight the importance of considering the
scalability of approaches when mining features from ever-growing knowledge graphs. Our work alleviates part of the
computational cost (time and memory) of mining paths and path patterns but also reveals the need for a further reduction.
Such future research works could enable the modeling of more complex path patterns, for example, considering the
hierarchy of predicates.


References

 [1] Aidan Hogan, Eva Blomqvist, Michael Cochez, Claudia d’Amato, Gerard de Melo, Claudio Gutierrez, José
     Emilio Labra Gayo, Sabrina Kirrane, Sebastian Neumaier, Axel Polleres, Roberto Navigli, Axel-Cyrille Ngonga
     Ngomo, Sabbir M. Rashid, Anisa Rula, Lukas Schmelzeisen, Juan F. Sequeda, Steffen Staab, and Antoine
     Zimmermann. Knowledge graphs. CoRR, abs/2003.02320, 2020.
 [2] Christian Bizer, Tom Heath, and Tim Berners-Lee. Linked data - the story so far. Int. J. Semantic Web Inf. Syst.,
     5(3):1–22, 2009.
 [3] Petar Ristoski and Heiko Paulheim. Semantic web in data mining and knowledge discovery: A comprehensive
     survey. J. Web Semant., 36:1–22, 2016.
 [4] Luis Galárraga, Geremy Heitz, Kevin Murphy, and Fabian M. Suchanek. Canonicalizing open knowledge bases.
     In Jianzhong Li, Xiaoyang Sean Wang, Minos N. Garofalakis, Ian Soboroff, Torsten Suel, and Min Wang,
     editors, Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge
     Management, CIKM 2014, Shanghai, China, November 3-7, 2014, pages 1679–1688. ACM, 2014.
 [5] Heiko Paulheim. Generating possible interpretations for statistics from linked open data. In Elena Simperl,
     Philipp Cimiano, Axel Polleres, Óscar Corcho, and Valentina Presutti, editors, The Semantic Web: Research and
     Applications - 9th Extended Semantic Web Conference, ESWC 2012, Heraklion, Crete, Greece, May 27-31, 2012.
     Proceedings, volume 7295 of Lecture Notes in Computer Science, pages 560–574. Springer, 2012.
 [6] Baoxu Shi and Tim Weninger. Discriminative predicate path mining for fact checking in knowledge graphs.
     Knowl.-Based Syst., 104:123–133, 2016.
 [7] Tim Berners-Lee, James Hendler, Ora Lassila, et al. The Semantic Web. Scientific American, 284(5):28–37, 2001.
 [8] Thomas R Gruber. A translation approach to portable ontology specifications. Knowledge Acquisition, 5(2):199–
     220, 1993.
 [9] Charu C. Aggarwal and Haixun Wang, editors. Managing and Mining Graph Data, volume 40 of Advances in
     Database Systems. Springer, 2010.
[10] Pierre Monnin, Joël Legrand, Graziella Husson, Patrice Ringot, Andon Tchechmedjiev, Clément Jonquet, Amedeo
     Napoli, and Adrien Coulet. PGxO and PGxLOD: a reconciliation of pharmacogenomic knowledge of various
     provenances, enabling further comparison. BMC Bioinformatics, 20-S(4):139:1–139:16, 2019.
[11] Guillaume Bonfante, Bruno Guillaume, and Guy Perrier. Application of Graph Rewriting to Natural Language
     Processing. Wiley Online Library, 2018.
[12] Pablo Barceló, Leonid Libkin, and Juan L. Reutter. Querying graph patterns. In Maurizio Lenzerini and Thomas
     Schwentick, editors, Proceedings of the 30th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of
     Database Systems, PODS 2011, June 12-16, 2011, Athens, Greece, pages 199–210. ACM, 2011.
[13] Gerben Klaas Dirk de Vries and Steven de Rooij. A fast and simple graph kernel for RDF. In Claudia d’Amato,
     Petr Berka, Vojtech Svátek, and Krzysztof Wecel, editors, Proceedings of the International Workshop on Data
     Mining on Linked Data, with Linked Data Mining Challenge collocated with the European Conference on Machine
     Learning and Principles and Practice of Knowledge Discovery in Databases (ECMLPKDD 2013), Prague, Czech
     Republic, September 23, 2013., volume 1082 of CEUR Workshop Proceedings. CEUR-WS.org, 2013.
[14] Gerben Klaas Dirk de Vries and Steven de Rooij. Substructure counting graph kernels for machine learning from
     RDF data. J. Web Semant., 35:71–84, 2015.
[15] Luis Antonio Galárraga, Christina Teflioudi, Katja Hose, and Fabian M. Suchanek. AMIE: association rule mining
     under incomplete evidence in ontological knowledge bases. In Daniel Schwabe, Virgílio A. F. Almeida, Hartmut
     Glaser, Ricardo A. Baeza-Yates, and Sue B. Moon, editors, 22nd International World Wide Web Conference,
     WWW ’13, Rio de Janeiro, Brazil, May 13-17, 2013, pages 413–422. International World Wide Web Conferences
     Steering Committee / ACM, 2013.
[16] Josua Stadelmaier and Sebastian Padó. Modeling paths for explainable knowledge base completion. In Proceedings
     of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 147–157,
     2019.
[17] Heiko Paulheim and Johannes Fürnkranz. Unsupervised generation of data mining features from linked open
     data. In Dumitru Dan Burdescu, Rajendra Akerkar, and Costin Badica, editors, 2nd International Conference on
     Web Intelligence, Mining and Semantics, WIMS ’12, Craiova, Romania, June 6-8, 2012, pages 31:1–31:12. ACM,
     2012.
[18] Gilles Vandewiele, Bram Steenwinckel, Femke Ongenae, and Filip De Turck. Inducing a decision tree with
     discriminative paths to classify entities in a knowledge graph. In Zhe He, Jiang Bian, Cui Tao, and Rui Zhang,
     editors, Proceedings of the 4th International Workshop on Semantics-Powered Data Mining and Analytics co-
     located with the 18th International Semantic Web Conference (ISWC 2019), Aukland, New Zealand, October 27,
     2019., volume 2427 of CEUR Workshop Proceedings. CEUR-WS.org, 2019.
[19] Ramakrishnan Srikant and Rakesh Agrawal. Mining generalized association rules. In Umeshwar Dayal, Peter
     M. D. Gray, and Shojiro Nishio, editors, VLDB’95, Proceedings of 21th International Conference on Very Large
     Data Bases, September 11-15, 1995, Zurich, Switzerland., pages 407–419. Morgan Kaufmann, 1995.
[20] Xuan Zhou and James Geller. Raising, to enhance rule mining in web marketing with the use of an ontology. In
     Data Mining with Ontologies: Implementations, Findings, and Frameworks, pages 18–36. IGI Global, 2008.
[21] Marcos Aurélio Domingues and Solange Oliveira Rezende. Using taxonomies to facilitate the analysis of the
     association rules. CoRR, abs/1112.1734, 2011.
[22] Claudia Marinica and Fabrice Guillet. Knowledge-based interactive postmining of association rules using
     ontologies. IEEE Trans. Knowl. Data Eng., 22(6):784–797, 2010.
[23] Petar Ristoski and Heiko Paulheim. Rdf2vec: RDF graph embeddings for data mining. In Paul T. Groth, Elena
     Simperl, Alasdair J. G. Gray, Marta Sabou, Markus Krötzsch, Freddy Lécué, Fabian Flöck, and Yolanda Gil,
     editors, The Semantic Web - ISWC 2016 - 15th International Semantic Web Conference, Kobe, Japan, October
     17-21, 2016, Proceedings, Part I, volume 9981 of Lecture Notes in Computer Science, pages 498–514, 2016.
[24] Minjun Chen, Ayako Suzuki, Shraddha Thakkar, Ke Yu, Chuchu Hu, and Weida Tong. DILIrank: the largest
     reference drug list ranked by the risk for developing drug-induced liver injury in humans. Drug Discov Today,
     21(4):648–653, 2016.
[25] Fabian M. Suchanek, Serge Abiteboul, and Pierre Senellart. PARIS: probabilistic alignment of relations, instances,
     and schema. PVLDB, 5(3):157–168, 2011.
[26] Petar Ristoski and Heiko Paulheim. Feature selection in hierarchical feature spaces. In Saso Dzeroski, Pance
     Panov, Dragi Kocev, and Ljupco Todorovski, editors, Discovery Science - 17th International Conference, DS 2014,
     Bled, Slovenia, October 8-10, 2014. Proceedings, volume 8777 of Lecture Notes in Computer Science, pages
     288–300. Springer, 2014.
[27] Claudia d’Amato, Steffen Staab, and Nicola Fanizzi. On the influence of description logics ontologies on concep-
     tual similarity. In Aldo Gangemi and Jérôme Euzenat, editors, Knowledge Engineering: Practice and Patterns,
     16th International Conference, EKAW 2008, Acitrezza, Italy, September 29 - October 2, 2008. Proceedings,
     volume 5268 of Lecture Notes in Computer Science, pages 48–63. Springer, 2008.
[28] Petar Ristoski and Heiko Paulheim. A comparison of propositionalization strategies for creating features from
     linked open data. In Ilaria Tiddi, Mathieu d’Aquin, and Nicolas Jay, editors, Proceedings of the 1st Workshop on
     Linked Data for Knowledge Discovery co-located with European Conference on Machine Learning and Principles
     and Practice of Knowledge Discovery in Databases (ECML PKDD 2014), Nancy, France, September 19th, 2014.,
     volume 1232 of CEUR Workshop Proceedings. CEUR-WS.org, 2014.

</pre>