<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>TACKLING SCALABILITY ISSUES IN MINING PATH PATTERNS</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Pierre Monnin</string-name>
          <email>pierre.monnin@loria.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Miguel Couceiro</string-name>
          <email>miguel.couceiro@loria.fr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Amedeo Napoli</string-name>
          <email>amedeo.napoli@loria.fr</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Emmanuel Bresso</string-name>
          <email>emmanuel.bresso@loria.fr</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Malika Smaïl-Tabbone</string-name>
          <email>malika.smail@loria.fr</email>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Adrien Coulet</string-name>
          <email>adrien.coulet@loria.fr</email>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Université de Lorraine</institution>
          ,
          <addr-line>CNRS, Inria, LORIA, F-54000 Nancy</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Université de Lorraine</institution>
          ,
          <addr-line>CNRS, Inria, LORIA, F-54000 Nancy</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Université de Lorraine</institution>
          ,
          <addr-line>CNRS, Inria, LORIA, F-54000 Nancy</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Université de Lorraine</institution>
          ,
          <addr-line>CNRS, Inria, LORIA, F-54000 Nancy</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>Université de Lorraine</institution>
          ,
          <addr-line>CNRS, Inria, LORIA, F-54000 Nancy</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff5">
          <label>5</label>
          <institution>Université de Lorraine</institution>
          ,
          <addr-line>CNRS, Inria, LORIA, F-54000 Nancy</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Features mined from knowledge graphs are widely used within multiple knowledge discovery tasks such as classification or fact-checking. Here, we consider a given set of vertices, called seed vertices, and focus on mining their associated neighboring vertices, paths, and, more generally, path patterns that involve classes of ontologies linked with knowledge graphs. Due to the combinatorial nature and the increasing size of real-world knowledge graphs, the task of mining these patterns immediately entails scalability issues. In this paper, we address these issues by proposing a pattern mining approach that relies on a set of constraints (e.g., support or degree thresholds) and the monotonicity property. As our motivation comes from the mining of real-world knowledge graphs, we illustrate our approach with PGxLOD, a biomedical knowledge graph.</p>
      </abstract>
      <kwd-group>
        <kwd>Path</kwd>
        <kwd>Path Pattern</kwd>
        <kwd>Ontology</kwd>
        <kwd>Knowledge Graph</kwd>
        <kwd>Scalability</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        Knowledge graphs [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] have a central role in knowledge discovery tasks. For example, Linked Open Data [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] have
been used in all steps of the knowledge discovery process [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. In particular, features mined from knowledge graphs
have been used in multiple applications such as knowledge base completion [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], explanations [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], or fact-checking [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
Here, we focus on knowledge graphs expressed using Semantic Web standards [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. In this context, vertices are either
individuals that represent entities of a world (e.g., places, drugs, etc.), literals (e.g., integers, dates, etc.), or classes
of individuals (e.g., Person, Drug, etc.). Arcs are defined by triples hsubject, predicate, objecti in the Resource
Description Format language. Such a triple states that the subject is linked to the object by a relationship qualified
by the predicate (e.g., has-side-effect, has-name, etc.). Classes and predicates are defined in ontologies, i.e.,
formal representations of a domain [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], and organized into two hierarchies ordered by the subsumption relation. In
Semantic Web standards, individuals, classes, and predicates are identified by a Uniform Resource Identifier (URI). We
view such a knowledge graph as a directed labeled multigraph K = (⌃ V , ⌃ A, V, A, s, t, `V , `A), where
• A is the set of arcs connecting vertices through predicates2.
• ⌃ V is the set of vertex labels, here, their URI3.
• ⌃ A is the set of arc labels, here, URIs of predicates of K.
• s : A ! V (respectively t : A !
• `V : V ! ⌃ V (respectively `A : A ! ⌃ A) maps a vertex (respectively an arc) to its label.
      </p>
      <sec id="sec-1-1">
        <title>V ) associates an arc to its source (respectively target) vertex.</title>
        <p>Hence, a triple hs, p, oi is represented by two vertices vs, vo 2V and an arc ahs,p,oi 2A. The source and target vertices
of ahs,p,oi are respectively vs and vo, i.e., s(ahs,p,oi) = vs and t(ahs,p,oi) = vo. The labels of vs, vo, and ahs,p,oi are
respectively s, o, and p, i.e., `V (vs) = s, `V (vo) = o, and `A(ahs,p,oi) = p.</p>
        <p>In this work, we consider the task of mining features from K that are associated with a set of vertices of interest, which
we call seed vertices. The set of seed vertices can be defined in intension (i.e., all vertices that instantiate a specified
ontology class) or in extension (i.e., by specifying the list of their URIs). For example, in the biomedical domain, an
expert may be interested in mining features associated with vertices that represent drugs causing a specific side effect.
We propose to mine from K the three following kinds of features: neighboring vertices, paths, and path patterns.
Neighboring vertices are vertices that can be reached in K from at least one seed vertex. A neighbor is associated with
all seed vertices from which it is reachable. Its support counts such seed vertices. For example, in the knowledge graph
depicted in Figure 1, the neighbor v6 is reachable from the seed vertices n1C and n2C , and thus its support is 2.
Paths are sequences of pair!s p e that represent an arc labeled by the predicate p incident to an individual e. A path is
associated with all seed vertices that root it in K. The support of a path counts such seed vertices. For example, the
support of! p1 v2! p2 v3 is 1 since only n1C root it, i.e., n1C! p1 v2! p2 v3 exists in K.
!cpa1ptuv2re!dp2byvt3hiespcaatpteturnre!dp1by! p1
More generally, paths may share several characteristics. For instance, intermediate vertices in paths may instantiate the
same ontology classes. We propose to capture these characteristics by considering path patterns in addition to paths.
Path patterns are sequences of pair!s p E, where p is a predicate and E is either an individual or a class. Such a pair
indicates that an arc labeled by p is incident to (i) E if E is an individual or (ii) an individual that instantiates E if E
is a class. A path pattern is associated with all seed vertices that root a path captured by the path pattern. Its support
counts such seed vertices. In the example graph depicted in Figure 1, v2 instantiates T1, and v3 instantiates T2. Thus,</p>
        <p>T1! p2 v3,! p1 v2! p2 T2, and! p1 T1! p2 T2. Since T2 is a subclass of T3, it is also</p>
        <p>T1! p2 T3. Note that! p1 T1! p2 T3 also captures! p1 v4! p2 v5, which is rooted by n2C .</p>
        <p>Consequently, the support of! p1 T1! p2 T3 is 2. This illustrates the fact that path patterns may capture additional
common characteristics of seed vertices, and thus interestingly complete paths. Mining these patterns constitutes a
challenging task due to the combinatorial nature and the size of real-world knowledge graphs, which naturally entail
scalability issues. For example,! p1 v2! p2 v3 can be generalized by up to 11 path patterns.</p>
        <p>
          This mining task and its inherent scalability issues constitute the main concerns of the present work. To the best of
our knowledge, works available in the literature do not address such issues in knowledge graphs with the adopted
granular modeling of path patterns. However, inspired by existing graph mining works [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], we propose an Apriori-based
approach that alleviates these scalability issues by relying on (i) a set of constraints (e.g., support or degree thresholds),
(ii) the hierarchy of ontology classes, (iii) an incremental expansion of paths and patterns, and (iv) the monotonic
character of the support of paths and patterns. We provide a reusable implementation on GitHub4.
The remainder of this paper is organized as follows. In Section 2, we outline related works that motivated our proposed
approach. In Section 3, we present in details our approach to mine paths and path patterns, and discuss how it tackles
scalability issues. We illustrate our framework in Section 4 on PGxLOD, a real-world biomedical knowledge graph [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ].
We comment on our results as well as indicate directions of future work in Sections 5 and 6.
2
        </p>
        <p>
          Related work
Path patterns have been widely studied in different settings, for example graph rewriting [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] and query answering [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ].
Here, we recall some works that tackle the problem of mining path patterns from knowledge graphs in different
application contexts.
        </p>
        <p>2Here, we discard literals from V and arcs that are incident to literals from A.
3Hence, |⌃ V | = |V |.
4https://github.com/pmonnin/kgpm
. . .
. . .
. . .
. . .
. . .</p>
        <p>. . .</p>
        <p>v1
v7</p>
        <p>p4
type
type
p4
p5</p>
        <p>T5
T6
p6
p6</p>
        <p>C
n2</p>
        <p>v8
type</p>
        <p>T1
type
v9
type
subClassOf
subClassOf
subClassOf
type</p>
        <p>&gt;
p1
v4
p2</p>
        <p>type
subClassOf
subClassOf
subClassOf
type
T2
T4
v5
p3
p3
T3
v6</p>
        <p>
          One major work dealing with feature mining from RDF graphs focuses on graph kernels that count common substructures
(i.e., walks, subtrees) [
          <xref ref-type="bibr" rid="ref13 ref14">13, 14</xref>
          ]. To avoid an explosion in the number of features, the authors remove patterns with a
low or high frequency [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]. Graph features can be used in various tasks such as knowledge base completion. For
example, AMIE [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] mines Horn clauses, i.e., conjunction of triples, to predict another triple. Similarly, in Context
Path Model [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ], the authors model paths as sequences of predicates between the source and target entities, i.e.,
s!p1 !p·2· · ! pk 1 !pk t. Here, a triple is predicted based on the paths existing between the involved entities. Shi and
Weninger [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] also model paths as sequences of predicates but they use them from a fact checking perspective. They check
whether a triple s! p t is true by predicting it from a set of learned discriminative paths os!p1 !p·2· · ! pk 1 !pk ot,
where os and ot are respectively the set of classes instantiated by s and t. Our framework differs from the previous two
as we aim at mining features by exploring the graph from the given seed vertices and we model path patterns using the
intermediate entities and the ontology classes that they instantiate.
        </p>
        <p>
          Our motivation also comes from explainable approaches that rely on the descriptive power of features mined from
knowledge graphs. For example, Explain-a-LOD [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] enriches statistical data sets with features from DBpedia. When
correlations can be established between statistics and DBpedia features, these features can be used as explanations for the
original statistics. For example, the quality of living in cities has been correlated with whether these cities are European
capitals. Explain-a-LOD leverages the different outputs of FeGeLOD [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ], some of which corresponding to our
approach. For instance, the so called relation!s p e are paths, whereas the so called qualified relation!s p t, where e is
replaced by a class t instantiated by e, are path patterns. Alternatively, Vandewiele et al. [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] propose to learn a decision
tree to classify entities based on paths of a knowledge graph. The authors suggest that the predictions of their model are
explainable as they are obtained by a “white-box” model (i.e., the decision tree) that combines interpretable features (i.e.,
paths from a knowledge graph). Interestingly, this system considers paths with their intermediate predicates and entities,
i.e., root!p1 e1 · · ·!pk ek. They allow a generalization of both predicates and entities by the use of a wildcard (*). In
our context, this would correspond to generalizing entities by the top level ontology class &gt;. However, unlike their
framework, we do not generalize predicates. In their study, they focus on paths of the form root! * * · ·!· * e, which
somewhat corresponds to extracting neighbors and their distance from seed vertices.
        </p>
        <p>
          Finally, path patterns are somewhat similar to generalized association rules [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] and the concept of raising [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ]. Indeed,
both works replace entities in rules by ontology classes to increase the support while preserving a high confidence.
Inspired by works that prune redundant generalized rules [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ], our approach mines paths and path patterns that are
non-redundant and comply with some given constraints.
        </p>
        <p>Knowledge graph K
Set of seed vertices N
1. Canonicalizing K
2. Mining interesting
neighbors and types
3. Mining interesting
paths and path patterns</p>
        <p>4. Optional and
domain-dependent filtering</p>
        <p>Binary matrix
M 2 B|N|⇥|F|
In this paper, we consider a knowledge graph K and a set of seed vertices N = {n1, n2, . . . , np} ✓ V . The task is to
mine neighbors, paths, and path patterns from K that are associated with these seed vertices. For example, given a set
of drugs that cause or not a side effect, we aim to mine features that can later be used to classify these drugs.
In the following subsections, we propose algorithms to build a binary matrix M of size |N | ⇥ |F | from the knowledge
graph K and the set of seed vertices N . The set F consists of interesting neighbors, paths, and path patterns mined from
K, i.e., neighbors, paths, and path patterns that satisfy the constraints defined in terms of the parameters summarized in
Table 1. These parameters will be detailed in the following subsections. M associates a seed vertex n 2N with its
features f 2 F, i.e., if Mn,f = true, then n has feature f . We outline our approach in Figure 2 where steps 1, 2, and
3 are mandatory, while step 4 is optional and depends on the application domain.
3.1</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Canonicalizing K</title>
      <p>
        The first step of our approach consists in canonicalizing the knowledge graph K, i.e., unifying vertices that represent
the same real-world entity. We use the canonicalization word by analogy with the canonicalization of knowledge
bases, which consists in unifying equivalent individuals into one [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Indeed, in knowledge bases under the Open
Information Extraction paradigm, facts and entities can be represented by synonymous terms, which leads to co-existing
and equivalent individuals. For example, in such knowledge bases, two individuals Obama and Barack Obama can
co-exist. Similarly, in K, vertices can be connected through arcs labeled by the owl:sameAs predicate, indicating that
these vertices are actually representing the same real-world entity.
      </p>
      <p>Such a situation typically arises when K comprises several data sets. For example, a drug can be represented by two
vertices linked by an owl:sameAs arc, resulting from the information extraction of two independent drug-related
databases. Therefore, their merging allows an easy access to the full extent of the knowledge in K about the drug they
represent. Such a canonicalization process corresponds to edge contraction in graph theory (i.e., taking graph quotient).
In our framework, it reduces to contracting arcs whose label is the owl:sameAs predicate.</p>
      <p>To perform this canonicalization, we must respect the semantics associated with the owl:sameAs predicate, and thus
take into account its symmetry and transitivity. Indeed, an owl:sameAs arc between two vertices either is explicitly
stated in K or follows from existing arcs and these two properties. Let us consider a vertex v in K. The canonicalization
step merges v with all its identical vertices based on owl:sameAs arcs. To compute this set of vertices, it suffices
to compute the connected component of v in the undirected spanning subgraph formed by the owl:sameAs arcs of
K. Indeed, undirected edges comply with the symmetry of owl:sameAs and connected components comply with the
transitivity of owl:sameAs.</p>
      <p>As a result, this step takes K as input and outputs its canonical graph KC = ⌃ VC , ⌃ CA, V C , AC , sC , tC , `VC , `CA .
Similarly to K, KC is a directed labeled multigraph where
• V C is the set of canonical vertices.
• AC is the set of canonical arcs connecting canonical vertices through predicates.
• ⌃ VC is the set of canonical vertex labels.
• ⌃ CA is the set of canonical arc labels.
• sC : AC ! V C (respectively tC : AC ! V C ) associates a canonical arc to its canonical source (respectively
target) vertex.
• `VC : V C ! ⌃ VC (respectively `CA : AC ! ⌃ CA) maps a canonical vertex (respectively a canonical arc) to its
label.</p>
      <p>Each canonical vertex in KC represents a vertex from K and all its identical vertices. It is possible for a canonical vertex
in KC to only represent one vertex v from K if v has no identical vertices. This corresponds to creating a surjective
mapping : V ! V C associating a vertex from K to its equivalent canonical vertex in KC . Canonical arcs in KC are
constructed by using to map the source and target vertices of arcs in K to canonical vertices. Similarly, the set of seed
vertices N is mapped to the set of canonical seed vertices, denoted by N C .</p>
      <p>
        Remark 1. Note that storing URIs has a high memory footprint. Thus, in KC , we use indices in N instead of URIs to label
vertices and arcs, i.e., ⌃ VC ✓ N and ⌃ CA ✓ N. This leads to a reduced memory consumption in subsequent algorithms.
Each canonical vertex has one unique label, differing from labels of other canonical vertices, i.e., |⌃ VC | = |V C |. This
“relabeling” is inspired by the work of de Vries and de Rooij [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] that use a structure named pathM ap to represent a
path by an integer. We developed our own structure for this relabeling, which we named CacheManager.
3.2
3.2.1
      </p>
      <p>Mining interesting neighbors and types</p>
      <p>Mining interesting neighbors
Here, we select all vertices that are neighbors of at least one seed vertex in N C by performing a breadth-first search
constrained by parameters k, d, u, bpredicates, and bexp-types. Neighbors are selected by traversing at most k arcs from the
seed vertices in N C . If u = false, then only outgoing arcs are traversed; otherwise, all arcs are traversed regardless of
their orientation.</p>
      <p>However, not all neighboring vertices are of interest. For example, we want to avoid provenance metadata vertices.
Indeed, they may not constitute discriminative features as they are specific to the vertex they describe. As we aim to
use ontology classes to generate path patterns, we also need to keep the graph exploration over the individuals of K
and avoid traversing rdf:type arcs. To this aim, we do not traverse arcs that are labeled by a predicate whose URI or
prefix of URI is blacklisted in bpredicates. For example, we blacklist in bpredicates the prefix of the provenance ontology
PROV-O5 and the URI of the rdf:type predicate6.</p>
      <p>Additionally, we provide a blacklist bexp-types of URIs or prefixes of classes whose instances must not be reached.
Hence, we do not reach individuals that instantiate directly or indirectly a blacklisted class, by following rdf:type
and rdfs:subClassOf arcs. For example, in a use case of classifying drugs that cause or not a side effect, one may
want to avoid neighbors that represent the side effect. That is why, the ontology class representing the side effect is
blacklisted in bexp-types.</p>
      <p>5http://www.w3.org/ns/prov#
6http://www.w3.org/1999/02/22-rdf-syntax-ns#type
When mining neighboring vertices, we may encounter vertices with a high degree, hereafter named hubs. If the
graph exploration considered their numerous neighbors, then the size of the selected neighborhood would increase
exponentially, thus causing a scalability issue. Additionally, hub neighbors may not constitute specific and discriminative
features. Indeed, if a hub can be reached from some seed vertices, i.e., appears in their neighborhood, the neighbors
of the hub will be reached by the same seed vertices. That is why, in our approach, we propose to stop the graph
exploration at vertices whose degree is strictly greater than parameter d7.</p>
      <p>Remark 2. If u = false, then the degree of a vertex only counts outgoing arcs, otherwise all arcs are counted. The
degree does not count arcs whose predicate is blacklisted in bpredicates. The degree counts arcs incident to a vertex that
instantiates a blacklisted class in bexp-types.</p>
      <p>As a result, with k, d, u, bpredicates, and bexp-types fixed, we obtain a set of neighboring vertices, denoted by N (N C ) ✓ V C .
Each neighboring vertex v 2 N(N C ) may only appear in the neighborhood of some seed vertices from N C w.r.t. the
parameters. Thus, v is associated with these seed vertices, which we indicate by defining the support set of v.
Definition 1 (Support set of a neighbor). For a given choice of these parameters, the support set of a neighboring
vertex v 2 N(N C ) is denoted by SUPPORTSET(v) ✓ N C and defined as the set of seed vertices from N C having v as
neighbor. The support of a neighbor is defined as the cardinal of its support set.</p>
      <p>Note that some vertices in N (N C ) are not very discriminative: when they are associated with very few vertices from
N C or nearly all of them. This motivates the use of parameters lmin and lmax that define the minimum and maximum
support for a neighbor to appear in the set F of features. Hence, a neighbor v 2 N(N C ) constitutes a feature in the
output matrix M if and only if lmin  | SUPPORTSET(v)|  lmax. We denote the set of interesting neighbors to appear
in F by</p>
      <p>Nl(N C ) = v | v 2 N(N C ) and lmin  | SUPPORTSET(v)|  lmax .</p>
      <p>In M, for nC 2N C and v 2 Nl(N C ), we have MnC,v = true if and only if nC 2SUPPORTSET(v).
Example 1. From KC in Figure 1 and with k = 3, d = 4, lmin = 2, lmax = 3, u = false, bpredicates =
{type, subClassOf}, and bexp-types = ; , we obtain:</p>
      <p>N (N C ) = {v1, v2, v3, v4, v5, v6, v8, v9} ;
SUPPORTSET(v1) = SUPPORTSET(v6) = n1C , n2C ;
SUPPORTSET(v2) = SUPPORTSET(v3) = SUPPORTSET(v8) = n1C ;
SUPPORTSET(v4) = SUPPORTSET(v5) = SUPPORTSET(v9) = n2C ;</p>
      <p>Nl(N C ) = {v1, v6} .</p>
      <p>The graph exploration stops at v1. Indeed, it is considered a hub as its degree is greater than d. Thus, vertices on the left
of Figure 1 are not explored. Because u = false, the graph exploration cannot reach v7. If bexp-types = {T3}, then v3
and v5 cannot be traversed, resulting in N (N C ) = {v1, v2, v4, v8, v9}.
3.2.2</p>
      <p>Mining interesting types
Observe that we can use N (N C ) to compute interesting types over the considered neighborhood. These interesting
types will alleviate a scalability issue arising when building path patterns in Subsection 3.3. Interesting types must be
computed over N (N C ) and not Nl(N C ) as vertices whose support is below lmin can instantiate interesting types. As
an intuitive example, in Figure 1, T3 is associated with both n1C (because of v3) and n2C (because of v5). For lmin = 2,
v3 and v5 will not be selected as features, however T3 can be used in path patterns.</p>
      <p>Parameters t and bgen-types constrain the ontology classes considered in the construction of path patterns, and thus they
are integrated in the computation of interesting types. Parameter t specifies the maximum level of considered classes
in ontology hierarchies. This level is computed by starting at vertices to generalize and following rdf:type and
rdfs:subClassOf arcs. t = 0 only allows to generalize vertices with &gt;, which is considered to be instantiated by
all vertices. For example, v3 can be generalized by T2 and &gt; if t = 1, and by T2, T3, and &gt; if t = 2. Additionally,
types used for generalization must not be blacklisted in bgen-types. This blacklist consists of URIs or prefixes of ontology
classes not to be used during the construction of path patterns. For example, we refrain from considering general classes
such as pgxo:Drug8.</p>
      <p>
        7From a similar assessment, de Vries and de Rooij [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] tackle the hub issue by removing edges based on frequency of pairs
(source, predicate) and (predicate, target).
      </p>
      <p>8http://pgxo.loria.fr/Drug
To compute interesting types, we must first compute their support set. This motivates the following predicate:
inst(v, T, t, bgen-types) =</p>
      <p>false otherwise
⇢true if v instantiates T under parameters t and bgen-types</p>
      <sec id="sec-2-1">
        <title>We can then define the support set of an ontology class T as follows:</title>
        <p>Definition 2 (Support set of an ontology class). The support set of an ontology class T is the union of support sets of
vertices v 2 N(N C ) that can be generalized by T under parameters t and bgen-types. Formally:</p>
        <p>SUPPORTSET(T ) =</p>
      </sec>
      <sec id="sec-2-2">
        <title>SUPPORTSET(v)</title>
        <p>[
v2N (NC)
inst(v,T,t,bgen-types)
Finally, the set of interesting types used to build path patterns is defined as T lmin = {T | lmin  | SUPPORTSET(T )|}.
Example 2. With the same parameters as in Example 1, we obtain</p>
        <p>SUPPORTSET(T2) =
n1C ;</p>
        <p>SUPPORTSET(T1) =
n1C , n2C ;</p>
        <p>T lmin = {T1, T3, T5, T6, &gt;} .
3.3</p>
        <p>Mining interesting paths and path patterns
This step focuses on mining interesting paths and path patterns rooted by seed vertices nC 2N C . We use the term path
feature (PF) to indicate a path or a path pattern.
! p1 T1! p2 T3 can be rooted by n1C and n2C and is of length 2.</p>
        <p>Definition 3 (Path feature). A path feature is a sequence of atomic elements that are pair!s E where p is a predicate
and E is either (i) an individual (for paths), or (ii) an individual or an ontology class (for path patterns). The length of a
path feature counts the number of its atomic elements.</p>
        <p>Example 3. In Figure 1, the path! p1 v2! p2 v3! p3 v6 can be rooted by n1C and is of length 3. The path pattern
p
Interesting path features are built by a breadth-first expansion starting at vertices in N C . As previously, k defines the
maximum number of arcs traversed. Hence, path features are of length 1 to k. Observe that a scalability issue arises
when mining interesting path features. Indeed, there may be several paths between two vertices and each vertex in a
path can be generalized by several ontology classes. For instance, for t = 2,! p1 v2! p2 v3! p3 v6 can be generalized
by up to 23 path patterns. We propose a mining procedure that alleviates the scalability issues associated with the
mining of path patterns. This mining procedure relies on the monotonicity of the support set of path features, which is
defined as follows:
Definition 4 (Support set of a path feature). The support set of a path consists of all vertices from N C that root it in
KC . Formally,</p>
        <p>SUPPORTSET(!pa va . . .! pb vb) = nnC</p>
        <p>2N C | nC!pa va . . .! pb vb exists in KC o .</p>
        <p>The support set of a path pattern consists of all vertices from N C that root a path in KC that is captured by the path
pattern. Formally,</p>
        <p>SUPPORTSET(!pa Ea . . .! pb Eb) = nnC</p>
        <sec id="sec-2-2-1">
          <title>2N C | nC!pa va . . .! pb vb exists in KC and</title>
          <p>o
8 vi, vi = Ei or inst(vi, Ei, t, bgen-types) .</p>
          <p>The support of a path feature is defined as the cardinal of its support set.</p>
          <p>Example 4. From Figure 1, we have SUPPORTSET(! p1 v2! p2 v3! p3 v6) =
n1C .</p>
          <p>Our approach of mining interesting path features is guided by the dependency structure as illustrated in Figure 3. At
first, this structure is empty and it is then augmented at each iteration of Algorithm 1, whose operations are described
and illustrated below.</p>
          <p>Remark 3. As introduced in Remark 1, there are some hacks to help mitigating some of the scalability drawbacks.
Storing and manipulating a path feature as a list of elements has a high memory footprint. Thus, such a list is only
stored once in our CacheManager structure and the returned index (from N) is used in mining algorithms.
Algorithm 1 Mining interesting paths and path patterns
Input: The canonical knowledge graph KC , the set of canonical seed vertices N C , the set of interesting types T lmin
Parameters: k, t, d, lmin, lmax, u, bpredicates, bexp-types, bgen-types
Output: F and M completed with interesting paths and path patterns
1: h 1
2: repeat
3: Expand paths in Ph
4: Generalize expanded paths into path patterns
5: Keep most specific path features
6: Select (i) generated path features to add to F (complete M accordingly), and (ii) paths to add to Ph+1
7: h h + 1
8: until h &gt; k or Ph = ;
3.3.1</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Expand paths in Ph</title>
      <p>Each path P 2 Ph9 is expanded with pairs! pe ve that are chosen in the neighborhood of the last individual of P . This
choice is constrained by parameters k, d, u, bpredicates, and bexp-types as in Subsection 3.210. In the first iteration, the
neighborhood of seed vertices in N C is used. If no path in Ph can be expanded, i.e., the neighborhood of their last
vertex does not contain reachable vertices under the constraints, then Algorithm 1 ends.</p>
      <p>Example 5. From the graph in Figure 1, the first expansion generates the following paths:! p4 v1,! p1 v2,! p1 v4,
! p6 v8, and! p6 v9. In the second expansion,! p4 v1 is not expanded as v1 is a hub under d = 4. Since their respective
neighborhood does not contain reachable vertices,! p6 v8 and! p6 v9 are also not expanded. The expansion of! p1 v2
generates! p1 v2! p2 v3 whereas the expansion of! p1 v4 generates! p1 v4! p2 v5.
3.3.2</p>
      <p>Generalize expanded paths into path patterns
Let Pe be the expansion of P 2 Ph as previously described, i.e., Pe = P! pe ve. We generalize Pe by:
• Generating patterns P! pe T with types T 2 T lmin for which the predicate inst(ve, T, t, bgen-types) is verified.
! pe ve, and! pe T for all T 2 T lmin for which the predicate inst(ve, T, t, bgen-types) is verified.</p>
      <p>• Retrieving from the dependency structure the path patterns that generalize P 11 and expanding them with</p>
      <sec id="sec-3-1">
        <title>Intuitively, this generalization operation allows to expand path patterns.</title>
        <p>which is expanded into! p1 T1! p2 v3,! p1 T1! p2 T3, and! p1 T1! &gt;</p>
        <sec id="sec-3-1-1">
          <title>Example 6. In the first iteration,! p1 v2 is generalized by! p1 T1 and! p&gt;1</title>
          <p>is not kept in the
dependency structure at the end of the first iteration (see Subsections 3.3.3 and 3.3.4). In the second iteration,! p1 v2
expands into! p1 v2! p2 v3. In the dependency structure, we retriepv2e! p1 T1 as the path pattern generalizing! p1 v2,
. As we will see,! p&gt;1
.</p>
          <p>We only generalize paths with types T 2 T lmin to avoid the generation an important number of uninteresting path
patterns that would then be discarded, thus reducing the memory footprint. Indeed, by definition, if T 62 T lmin , then
|SUPPORTSET(T )| &lt; lmin. Additionally, given a path feature P , |SUPPORTSET(P )|  minE2 P |SUPPORTSET(E)|,
where E can be a class or an individual involved in P . Thus, if T 62 T lmin is used in a path pattern P , we would have
|SUPPORTSET(P )| &lt; lmin, therefore generating an uninteresting path pattern that would be discarded later.
3.3.3</p>
          <p>
            Keep most specific path features
Inspired by works that prune redundant generalized rules during their generation [
            <xref ref-type="bibr" rid="ref21">21</xref>
            ], we keep only the “most specific”
path patterns among those that have the same support set.
          </p>
          <p>9Ph is explained in Subsection 3.3.4.</p>
          <p>10Additionally, to avoid loops, P can only be expanded at iteration h with individuals ve such that there exists at least one seed
vertex in SUPPORTSET(P ) whose shortest distance to ve is h.</p>
          <p>11Not all generated path patterns remain in the dependency structure at the end of an iteration, see Subsections 3.3.3 and 3.3.4.
more specific than another atomic element!p2 E2 if and only if:
Definition 5 (More specific path pattern). A path pattern P1 is more specific than another path pattern P2 if every
atomic element of P1 is more specific than the atomic element of P2 at the same position. An atomic element!p1 E1 is
(i) p1 = p2, i.e., both atomic elements involve the same predicate12, and
(ii) E1 is more specific than E213.</p>
          <p>When path patterns have the same support set, keeping the most specific ones remove redundant generalizations, thus
reducing their number and the computational burden. Additionally, we ensure a high descriptive power because the
most specific paths are the most descriptive. Intuitively, a path pattern involving a class T is less descriptive than
another pattern involving a subclass or an instance of the class. However, keeping the most specific path patterns is
computationally expensive, which led us to propose the following computational procedure.</p>
          <p>We notice that the support set of a path pattern is the union of the support sets of the paths it generalizes. Therefore,
we discard path patterns that generalize only one path. Indeed, such path patterns have the same support set as their
original path and are more general, by definition.</p>
          <p>Example 7. For h = 3, we discard the following path patterns:!p1 v2!p2 v3!p&gt;3
and!p1 v4!p2 v5!p&gt;3</p>
          <p>However, there may exist path patterns that generalize several more specific path features while having the same support
set. Such path patterns should also be discarded.</p>
          <p>Example 8. For h = 1,!p&gt;1
discarded.</p>
          <p>shares the same support set as the more specific path pattern!p1 T1 and thus should be
To efficiently discard path patterns, we avoid computing their whole hierarchy. Instead, we focus on retaining only the
most specific ones in the prefix tree depicted in Figure 4. This prefix tree is incrementally augmented and stores the most
specific path patterns for a specific iteration and support set. In this tree, individuals/classes and predicates involved in
path patterns are indexed separately. Thus, its depth is twice the length of path patterns of the current iteration.
The prefix tree enables an efficient storage and selection of the most specific path patterns. Indeed, let P be a path
pattern to be compared with those already stored. A breadth-first traversal is performed to detect more specific patterns
than P : we only traverse identical or more specific elements according to Definition 5. At any depth, if such elements
cannot be found, then P is one of the most specific patterns and the traversal stops. On the contrary, if the traversal
reaches a leaf containing more specific elements, then there is a pattern more specific than P with the same support set.
Consequently, P is discarded and removed from the dependency structure.</p>
          <p>If P is to be stored, another breadth-first traversal is performed by considering identical or more general elements than
the ones in P . When the traversal reaches a leaf, it means that more general patterns than P are currently stored and
have the same support set. These are removed from the prefix tree before storing P . They are also removed from the
dependency structure.</p>
          <p>Remark 4. As the prefix tree relies on associative arrays and sets, the computational cost of traversal, insertion, and
removal is reduced. Additionally, resetting the tree at each expansion and each support set reduces the number of
patterns to traverse and thus the global computational cost.
3.3.4 Select generated path features to add to F , and paths to add to Ph+1
In this subsection, we determine (i) which paths and path patterns should be added as features in F , and consequently,
(ii) which paths to add to Ph+1, i.e., to expand during the next iteration.</p>
          <p>A path feature P can be added in F if:
(C1) lmin  SUPPORTSET(P )  lmax,
(C2) Prefixes of P with the same support set do not already exist in F ,
(C3) If P is a path pattern, it does not generalize a path with the same support set.</p>
          <p>When P is in F , the output binary matrix M verifies MnC,P = true for all nC 2SUPPORTSET(P ).
12It is noteworthy that we do not consider the hierarchy of predicates in this work.
13A class is more specific than all its super-classes and an individual is more specific than all classes it instantiates.</p>
          <p>Remark 5. Note that (C2) allows to focus on shorter paths in F . However, (C2) is not applied if P ends with an
individual and there exist a prefix of P in F that ends with a class. P is considered more descriptive than its prefix,
because of the individual in the last position. Hence, we add P in F and remove its prefix. For example, in Figure 3,
!p1 T1!p2 T3 is replaced by!p1 T1!p2 T3!p3 v6 in F for h = 3.</p>
          <p>Remark 6. (C3) is motivated as the path is more specific and thus more descriptive than P and should be added instead
of P .</p>
          <p>We also select the paths and path patterns to expand during the next iteration. To reduce their number, we rely on the lmin
constraint and the monotonicity of the support set. It is clear that, for a path feature P , we have |SUPPORTSET(P )| 
minE2 P |SUPPORTSET(E)|, where E can be a class or an individual involved in P . Thus, when expanding a path
feature, its support set remains identical (for paths and path patterns) or decreases (for path patterns).
Consequently, we add to Ph+1 paths whose expansion may generate path features complying with the lmin constraint,
i.e., paths with a support greater than lmin or paths that are generalized by a pattern with a support greater than lmin.
This monotonicity property also lets us remove from the dependency structure patterns whose support is lower than
lmin. Indeed, such patterns cannot be used during the generalization step of the next iteration since they will inevitably
generate a pattern whose support is smaller than lmin. As a result, the monotonicity property does entail a reduction in
the number of paths and path patterns considered in the next iteration, thus reducing the computational cost.
Example 9. For example, in the first iteration, the path!p1 v2 is added to P2 as it is generalized by!p1 T1 whose
support is greater than lmin = 2. At the end of the second iteration, we remove!p1 v2!p2 T3 because its support is
lower than lmin = 2, and thus its expansion cannot generate an interesting pattern.
3.4</p>
          <p>Optional and domain-dependent filtering
After the previous steps, we obtain a feature set F containing interesting neighbors, paths, and path patterns. These
features have been mined without taking into account domain constraints known to experts. We propose to apply
domain-dependent filtering on F with parameter m. Such filters reduce the size of F and integrate interestingness
constraints based on expert knowledge.</p>
          <p>Example 10. To classify drugs causing or not a side effect, experts may want to focus on features containing a
biological pathway, a gene or a GO class, or a MeSH class. Therefore, we propose three atomic filters, only keeping
neighbors, paths, and path patterns containing at least a pathway (m = p), a gene or a GO class (m = g), or a MeSH
class (m = m). Such atomic filters can be combined to form disjunctive filters. For example, the m = pg filter keeps
features from F containing at least a pathway or a gene or a GO class. When a filter is applied to a neighbor, this
neighbor must be, e.g., a pathway for the p filter. When a filter is applied to a path or a path pattern, it means that one of
its individuals / ontology classes must be, e.g., a pathway for the p filter.</p>
          <p>
            This domain-dependent filtering is similar to approaches that generalize association rules and prune those that involve
some specified ontology classes [
            <xref ref-type="bibr" rid="ref21 ref22">21, 22</xref>
            ].
          </p>
          <p>Experimental setup
To illustrate our approach, we will address the following task: from a knowledge graph, mine a set of features to classify
drugs depending on whether they cause a specific side effect.</p>
          <p>
            We explore PGxLOD14 [
            <xref ref-type="bibr" rid="ref10">10</xref>
            ], a knowledge graph that aggregates several sets of Linked Open Data (LOD) describing
drugs, phenotypes, and genetic factors: PharmGKB, ClinVar, DrugBank, SIDER, DisGeNET, and CTD. This
aggregation may lead to features combining units from several LOD sets. Indeed, LOD sets may contain different and
incomplete knowledge. Their combined use then enables leveraging a greater amount of knowledge, where some LOD
sets complete information provided by others. This asks for a canonical knowledge graph as described in Subsection 3.1.
For instance, it is possible to complete the knowledge related to a drug described in PharmGKB if it is linked with an
owl:sameAs arc to the same drug described in DrugBank. This constitutes the key interest in combining LOD sets in
knowledge discovery and data mining tasks, as discussed by Ristoski and Paulheim [
            <xref ref-type="bibr" rid="ref23">23</xref>
            ].
          </p>
          <p>
            We will use the following data sets that comprise positive ( ) and negative ( ) drug examples:
Data set 1 (Drug Induced Liver Injury (DILI) [
            <xref ref-type="bibr" rid="ref24">24</xref>
            ]). It is formed by 1,036 drugs in 4 classes: “most DILI concern”
(192 drugs), “ambiguous DILI concern” (254 drugs), “less DILI concern” (278 drugs), and “no DILI concern” (312
drugs). We mapped these drugs from their PubChem identifiers to identifiers from PharmGKB, otherwise DrugBank,
otherwise KEGG, resulting in the set of seed vertices N DILI = N DILI [ N DILI such that:
• |N DILI| = 146 drugs (118 from PharmGKB, 17 from DrugBank, and 11 from KEGG). The positive drug
examples are from the “most DILI concern” class.
• |N DILI| = 224 drugs (206 from PharmGKB, 9 from DrugBank, and 9 from KEGG). The negative drug
examples are from the “no DILI concern” class.
          </p>
          <p>Data set 2 (Severe Cutaneous Adverse Reactions (SCAR)15). It is formed by 874 drugs in 5 classes: “very probable”
(18 drugs), “probable” (19 drugs), “possible” (94 drugs), “unlikely” (697 drugs), and “very unlikely” (46 drugs). We
mapped these drugs from their PubChem identifiers to identifiers from PharmGKB, otherwise DrugBank, otherwise
KEGG, resulting in the set of seed vertices N SCAR = N SCAR [ N SCAR such that:
• |N SCAR| = 102 drugs (100 from PharmGKB and 2 from DrugBank). The positive drug examples are from the
“very probable”, “probable”, and “possible” classes.
• |N SCAR| = 290 drugs (286 from PharmGKB and 4 from DrugBank). The negative drug examples are from
the “unlikely” and “very unlikely” classes.</p>
          <p>We implemented our approach in Python16. We used a server with 700 GB of RAM and the following parameter values
k 2 {1, 2, 3, 4}, t 2 {1, 2, 3}, d = 500, u = false, lmin = 5, lmax = +1 , and m 2 {p, g, m, pg, pgm}. It should be
noted that k = 4 was only tested with t = 1 because of memory issues caused by the high number of generated features.
Statistics about the features are detailed for k = 3, t = 3 and k = 4, t = 1 in Table 2 and discussed in the next section.
We obtained the features associated with the DILI data set under k = 3, t = 3 in approximately 1 hour. However,
computing the features with k = 4, t = 1 on the same data set required 4 days and 380 GB of RAM.
5</p>
          <p>Results and discussion
The first two lines of Table 2 show the number of neighbors and types reachable before applying support limits.
Enforcing these limits constitutes a first reduction of these numbers, thus reducing the memory and computational
footprints. Here, numbers are approximately divided by 2. The mining of path features always relies on lmin, and
thus it is not possible to count the number of all possible paths and path patterns. However, we show the number of
path features generated during the mining, which already illustrates the combinatorial explosion. Enforcing support
constraints and removing redundant generalizations allow to reduce their number in F (here, approximately by 20).
Finally, the domain-dependent filtering defined by m also radically scales down the number of features ultimately
output. However, this filtering only happens as post-processing and does not alleviate the scalability issues arising
during the mining of patterns.</p>
          <p>We observe a drastic increase in the number of neighbors and path features alongside k, which highlights the scalability
issues of mining large knowledge graphs. Considering additional levels in ontology hierarchies by increasing t also
14https://pgxlod.loria.fr
15http://www.regiscar.org/
16https://github.com/pmonnin/kgpm
multiplies the number of path features. To illustrate, with 700 GB of RAM, we could not set t to values greater than
1 for k = 4. Consequently, there is still room for improvements in terms of memory consumption, e.g., through an
efficient storing of paths and path patterns. These future improvements could enable to consider the full neighborhood
of seed vertices, which is reached for greater values of k and t and involves a far greater amount of vertices. For
example, for the DILI data set, the full neighborhood is reached for k = 23 and t = 21 (for d = 500), or k = 19 and
t = 21 (when the degree constraint is disabled). The full amount of reachable neighbors is 4 times (d = 500) or 9 times
greater (d = +1 ) than with k = 4.</p>
          <p>
            The values of parameters depend on the objectives and domain knowledge of the analyst guiding the mining process,
especially for blacklists and support thresholds. However, metrics about the knowledge graph may provide guidance.
Indeed, statistics about node degrees can help to find a trade-off between exploration and combinatorial explosion with
parameter d. Similarly, the depth of class hierarchies influences the value of t. For example, general classes may not
be of interest to the analyst, thus reducing t. The parameter k can be set by considering the graph diameter. As it is
common in mining processes, iterations may be required to find the best configuration. Regarding m, it is for now
hard-coded and only suitable to some biomedical applications. Inspired by ontologies that allow to interactively define
mining workflows [
            <xref ref-type="bibr" rid="ref3">3</xref>
            ], we could adapt this parameter to other applications by proposing such an interactive definition.
When manually reviewing the output features, we noticed multiple path features across the aggregated LOD sets. This
is made possible by aggregating and canonicalizing multiple LOD sets in the knowledge graph. This result particularly
illustrates one of the fundamental aspects of Linked Open Data: the combination of different data sets enables to go
beyond their original purposes and coverage. However, it is clear that combining LOD sets leads to bigger knowledge
graphs, exacerbating the scalability issues.
          </p>
          <p>
            Regarding our approach, we only canonicalize vertices, i.e., individuals and ontology classes. Nevertheless, predicates
used on arcs can also be identified as identical, leading to a canonicalization of arcs. In this context, we could benefit
from matching approaches, such as PARIS [
            <xref ref-type="bibr" rid="ref25">25</xref>
            ]. By identifying identical classes, predicates, and individuals, these
matching approaches could further improve the canonicalization and, therefore, increase the number of common
features between seed vertices from different data sets. Similarly, we could consider literals and arcs incident to literals
that were purposely discarded here. However, the canonicalization of literals raises several challenging issues due
to their heterogeneity in terms of syntactic variations, unit measures, and the precision of numerical values. Other
reasoning mechanisms and semantics associated with Semantic Web standards could be taken into account. For example,
predicates can be defined as transitive, and thus the canonical knowledge graph could also result from their transitive
closure.
          </p>
          <p>
            Regarding the modeling of path patterns, we could generalize paths with both the hierarchy of classes and the hierarchy
of predicates. In addition to keeping the most specific patterns, we could use other metrics to further reduce their
redundancy (e.g., approaches relying on hierarchies [
            <xref ref-type="bibr" rid="ref26 ref27">26, 27</xref>
            ] and extents of ontological classes [
            <xref ref-type="bibr" rid="ref27">27</xref>
            ]). This could also
reduce the number of generated patterns, therefore improving the scalability of the mining approach. Neighbors could
be enriched with the distance between them and seed vertices, which would correspond to the generalized paths of
KGPTree [
            <xref ref-type="bibr" rid="ref18">18</xref>
            ]. We could also use other approaches than binary features (e.g., counting [
            <xref ref-type="bibr" rid="ref13 ref14">13, 14</xref>
            ], relative counting [
            <xref ref-type="bibr" rid="ref28">28</xref>
            ]).
More importantly, it remains to test our mined features within a complete classification task to measure the influence of
k, t, and the three kinds of features (neighbors, paths, and path patterns).
6
          </p>
          <p>Conclusion
In this preliminary study, we addressed the scalability issues associated with the task of mining neighbors, paths, and
path patterns from a knowledge graph and a set of seed vertices. We proposed a method for tackling these issues,
which we illustrated by mining a real-world knowledge graph. Our results highlight the importance of considering the
scalability of approaches when mining features from ever-growing knowledge graphs. Our work alleviates part of the
computational cost (time and memory) of mining paths and path patterns but also reveals the need for a further reduction.
Such future research works could enable the modeling of more complex path patterns, for example, considering the
hierarchy of predicates.</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Aidan</given-names>
            <surname>Hogan</surname>
          </string-name>
          , Eva Blomqvist, Michael Cochez, Claudia d'Amato, Gerard de Melo, Claudio Gutierrez, José Emilio Labra Gayo, Sabrina Kirrane,
          <string-name>
            <given-names>Sebastian</given-names>
            <surname>Neumaier</surname>
          </string-name>
          , Axel Polleres, Roberto Navigli,
          <string-name>
            <surname>Axel-Cyrille Ngonga</surname>
            <given-names>Ngomo</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sabbir M. Rashid</surname>
            , Anisa Rula, Lukas Schmelzeisen,
            <given-names>Juan F.</given-names>
          </string-name>
          <string-name>
            <surname>Sequeda</surname>
          </string-name>
          ,
          <string-name>
            <surname>Steffen Staab</surname>
            , and
            <given-names>Antoine</given-names>
          </string-name>
          <string-name>
            <surname>Zimmermann</surname>
          </string-name>
          .
          <article-title>Knowledge graphs</article-title>
          .
          <source>CoRR</source>
          , abs/
          <year>2003</year>
          .02320,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Christian</given-names>
            <surname>Bizer</surname>
          </string-name>
          , Tom Heath, and
          <string-name>
            <surname>Tim</surname>
          </string-name>
          Berners-Lee.
          <article-title>Linked data - the story so far</article-title>
          .
          <source>Int. J. Semantic Web Inf. Syst.</source>
          ,
          <volume>5</volume>
          (
          <issue>3</issue>
          ):
          <fpage>1</fpage>
          -
          <lpage>22</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Petar</given-names>
            <surname>Ristoski</surname>
          </string-name>
          and
          <string-name>
            <given-names>Heiko</given-names>
            <surname>Paulheim</surname>
          </string-name>
          .
          <article-title>Semantic web in data mining and knowledge discovery: A comprehensive survey</article-title>
          .
          <source>J. Web Semant</source>
          .,
          <volume>36</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>22</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Luis</given-names>
            <surname>Galárraga</surname>
          </string-name>
          , Geremy Heitz, Kevin Murphy, and
          <string-name>
            <surname>Fabian</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Suchanek</surname>
          </string-name>
          .
          <article-title>Canonicalizing open knowledge bases</article-title>
          .
          <source>In Jianzhong Li, Xiaoyang Sean Wang</source>
          ,
          <string-name>
            <given-names>Minos N.</given-names>
            <surname>Garofalakis</surname>
          </string-name>
          , Ian Soboroff, Torsten Suel, and Min Wang, editors,
          <source>Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management</source>
          ,
          <string-name>
            <surname>CIKM</surname>
          </string-name>
          <year>2014</year>
          , Shanghai, China, November 3-
          <issue>7</issue>
          ,
          <year>2014</year>
          , pages
          <fpage>1679</fpage>
          -
          <lpage>1688</lpage>
          . ACM,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Heiko</given-names>
            <surname>Paulheim</surname>
          </string-name>
          .
          <article-title>Generating possible interpretations for statistics from linked open data</article-title>
          .
          <source>In Elena Simperl</source>
          , Philipp Cimiano, Axel Polleres, Óscar Corcho, and Valentina Presutti, editors,
          <source>The Semantic Web: Research and Applications - 9th Extended Semantic Web Conference, ESWC</source>
          <year>2012</year>
          , Heraklion, Crete, Greece, May
          <volume>27</volume>
          -31,
          <year>2012</year>
          . Proceedings, volume
          <volume>7295</volume>
          of Lecture Notes in Computer Science, pages
          <fpage>560</fpage>
          -
          <lpage>574</lpage>
          . Springer,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Baoxu</given-names>
            <surname>Shi</surname>
          </string-name>
          and
          <string-name>
            <given-names>Tim</given-names>
            <surname>Weninger</surname>
          </string-name>
          .
          <article-title>Discriminative predicate path mining for fact checking in knowledge graphs</article-title>
          .
          <source>Knowl.-Based Syst.</source>
          ,
          <volume>104</volume>
          :
          <fpage>123</fpage>
          -
          <lpage>133</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Tim</given-names>
            <surname>Berners-Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>James</given-names>
            <surname>Hendler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Ora</given-names>
            <surname>Lassila</surname>
          </string-name>
          , et al.
          <source>The Semantic Web. Scientific American</source>
          ,
          <volume>284</volume>
          (
          <issue>5</issue>
          ):
          <fpage>28</fpage>
          -
          <lpage>37</lpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Thomas</surname>
            <given-names>R</given-names>
          </string-name>
          <string-name>
            <surname>Gruber.</surname>
          </string-name>
          <article-title>A translation approach to portable ontology specifications</article-title>
          .
          <source>Knowledge Acquisition</source>
          ,
          <volume>5</volume>
          (
          <issue>2</issue>
          ):
          <fpage>199</fpage>
          -
          <lpage>220</lpage>
          ,
          <year>1993</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Charu</surname>
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Aggarwal</surname>
          </string-name>
          and Haixun Wang, editors.
          <source>Managing and Mining Graph Data</source>
          , volume
          <volume>40</volume>
          <source>of Advances in Database Systems</source>
          . Springer,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Pierre</surname>
            <given-names>Monnin</given-names>
          </string-name>
          , Joël Legrand, Graziella Husson, Patrice Ringot, Andon Tchechmedjiev, Clément Jonquet, Amedeo Napoli, and Adrien Coulet.
          <article-title>PGxO and PGxLOD: a reconciliation of pharmacogenomic knowledge of various provenances, enabling further comparison</article-title>
          .
          <source>BMC Bioinformatics</source>
          , 20-S(4):
          <volume>139</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>139</lpage>
          :
          <fpage>16</fpage>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Guillaume</surname>
            <given-names>Bonfante</given-names>
          </string-name>
          , Bruno Guillaume, and
          <string-name>
            <given-names>Guy</given-names>
            <surname>Perrier</surname>
          </string-name>
          .
          <article-title>Application of Graph Rewriting to Natural Language Processing</article-title>
          . Wiley Online Library,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Pablo</surname>
            <given-names>Barceló</given-names>
          </string-name>
          , Leonid Libkin,
          <string-name>
            <given-names>and Juan L.</given-names>
            <surname>Reutter</surname>
          </string-name>
          .
          <article-title>Querying graph patterns</article-title>
          . In Maurizio Lenzerini and Thomas Schwentick, editors,
          <source>Proceedings of the 30th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS</source>
          <year>2011</year>
          , June 12-16,
          <year>2011</year>
          , Athens, Greece, pages
          <fpage>199</fpage>
          -
          <lpage>210</lpage>
          . ACM,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Gerben</given-names>
            <surname>Klaas Dirk de Vries and Steven de Rooij</surname>
          </string-name>
          .
          <article-title>A fast and simple graph kernel for RDF</article-title>
          .
          <source>In Claudia d'Amato</source>
          , Petr Berka, Vojtech Svátek, and Krzysztof Wecel, editors,
          <source>Proceedings of the International Workshop on Data Mining on Linked Data, with Linked Data Mining Challenge collocated with the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECMLPKDD</source>
          <year>2013</year>
          ), Prague, Czech Republic,
          <year>September 23</year>
          ,
          <year>2013</year>
          ., volume
          <volume>1082</volume>
          <source>of CEUR Workshop Proceedings. CEUR-WS.org</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Gerben</given-names>
            <surname>Klaas Dirk de Vries</surname>
          </string-name>
          and Steven de Rooij.
          <article-title>Substructure counting graph kernels for machine learning from RDF data</article-title>
          .
          <source>J. Web Semant</source>
          .,
          <volume>35</volume>
          :
          <fpage>71</fpage>
          -
          <lpage>84</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15] Luis Antonio Galárraga, Christina Teflioudi, Katja Hose, and
          <string-name>
            <surname>Fabian</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Suchanek</surname>
          </string-name>
          . AMIE:
          <article-title>association rule mining under incomplete evidence in ontological knowledge bases</article-title>
          . In Daniel Schwabe,
          <string-name>
            <given-names>Virgílio A. F.</given-names>
            <surname>Almeida</surname>
          </string-name>
          , Hartmut Glaser, Ricardo A.
          <string-name>
            <surname>Baeza-Yates</surname>
          </string-name>
          , and Sue B. Moon, editors,
          <source>22nd International World Wide Web Conference, WWW '13</source>
          , Rio de Janeiro, Brazil, May
          <volume>13</volume>
          -17,
          <year>2013</year>
          , pages
          <fpage>413</fpage>
          -
          <lpage>422</lpage>
          . International World Wide Web Conferences Steering Committee / ACM,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Josua</given-names>
            <surname>Stadelmaier</surname>
          </string-name>
          and
          <string-name>
            <given-names>Sebastian</given-names>
            <surname>Padó</surname>
          </string-name>
          .
          <article-title>Modeling paths for explainable knowledge base completion</article-title>
          .
          <source>In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP</source>
          , pages
          <fpage>147</fpage>
          -
          <lpage>157</lpage>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Heiko</given-names>
            <surname>Paulheim</surname>
          </string-name>
          and
          <string-name>
            <given-names>Johannes</given-names>
            <surname>Fürnkranz</surname>
          </string-name>
          .
          <article-title>Unsupervised generation of data mining features from linked open data</article-title>
          .
          <source>In Dumitru Dan Burdescu</source>
          , Rajendra Akerkar, and Costin Badica, editors,
          <source>2nd International Conference on Web Intelligence</source>
          , Mining and Semantics, WIMS '12,
          <string-name>
            <surname>Craiova</surname>
          </string-name>
          , Romania, June 6-8,
          <year>2012</year>
          , pages
          <fpage>31</fpage>
          :
          <fpage>1</fpage>
          -
          <lpage>31</lpage>
          :
          <fpage>12</fpage>
          . ACM,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Gilles</surname>
            <given-names>Vandewiele</given-names>
          </string-name>
          , Bram Steenwinckel, Femke Ongenae, and Filip De Turck.
          <article-title>Inducing a decision tree with discriminative paths to classify entities in a knowledge graph</article-title>
          .
          <source>In Zhe He</source>
          , Jiang Bian, Cui Tao, and Rui Zhang, editors,
          <source>Proceedings of the 4th International Workshop on Semantics-Powered Data Mining and Analytics colocated with the 18th International Semantic Web Conference (ISWC</source>
          <year>2019</year>
          ), Aukland, New Zealand,
          <source>October</source>
          <volume>27</volume>
          ,
          <year>2019</year>
          ., volume
          <volume>2427</volume>
          <source>of CEUR Workshop Proceedings. CEUR-WS.org</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>Ramakrishnan</given-names>
            <surname>Srikant</surname>
          </string-name>
          and
          <string-name>
            <given-names>Rakesh</given-names>
            <surname>Agrawal</surname>
          </string-name>
          .
          <article-title>Mining generalized association rules</article-title>
          . In Umeshwar Dayal,
          <string-name>
            <surname>Peter M. D. Gray</surname>
          </string-name>
          , and Shojiro Nishio, editors,
          <source>VLDB'95, Proceedings of 21th International Conference on Very Large Data Bases, September 11-15</source>
          ,
          <year>1995</year>
          , Zurich, Switzerland., pages
          <fpage>407</fpage>
          -
          <lpage>419</lpage>
          . Morgan Kaufmann,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>Xuan</given-names>
            <surname>Zhou</surname>
          </string-name>
          and
          <string-name>
            <given-names>James</given-names>
            <surname>Geller</surname>
          </string-name>
          .
          <article-title>Raising, to enhance rule mining in web marketing with the use of an ontology</article-title>
          .
          <source>In Data Mining with Ontologies: Implementations, Findings, and Frameworks</source>
          , pages
          <fpage>18</fpage>
          -
          <lpage>36</lpage>
          . IGI Global,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <article-title>Marcos Aurélio Domingues and Solange Oliveira Rezende</article-title>
          .
          <article-title>Using taxonomies to facilitate the analysis of the association rules</article-title>
          .
          <source>CoRR, abs/1112.1734</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>Claudia</given-names>
            <surname>Marinica</surname>
          </string-name>
          and
          <string-name>
            <given-names>Fabrice</given-names>
            <surname>Guillet</surname>
          </string-name>
          .
          <article-title>Knowledge-based interactive postmining of association rules using ontologies</article-title>
          .
          <source>IEEE Trans. Knowl</source>
          . Data Eng.,
          <volume>22</volume>
          (
          <issue>6</issue>
          ):
          <fpage>784</fpage>
          -
          <lpage>797</lpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>Petar</given-names>
            <surname>Ristoski</surname>
          </string-name>
          and
          <string-name>
            <given-names>Heiko</given-names>
            <surname>Paulheim</surname>
          </string-name>
          .
          <article-title>Rdf2vec: RDF graph embeddings for data mining</article-title>
          . In Paul T. Groth, Elena Simperl,
          <string-name>
            <given-names>Alasdair J. G.</given-names>
            <surname>Gray</surname>
          </string-name>
          , Marta Sabou, Markus Krötzsch, Freddy Lécué, Fabian Flöck, and Yolanda Gil, editors,
          <source>The Semantic Web - ISWC 2016 - 15th International Semantic Web Conference</source>
          , Kobe, Japan,
          <source>October 17-21</source>
          ,
          <year>2016</year>
          , Proceedings,
          <string-name>
            <surname>Part</surname>
            <given-names>I</given-names>
          </string-name>
          , volume
          <volume>9981</volume>
          of Lecture Notes in Computer Science, pages
          <fpage>498</fpage>
          -
          <lpage>514</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <surname>Minjun</surname>
            <given-names>Chen</given-names>
          </string-name>
          , Ayako Suzuki, Shraddha Thakkar, Ke Yu,
          <string-name>
            <given-names>Chuchu</given-names>
            <surname>Hu</surname>
          </string-name>
          , and Weida Tong.
          <article-title>DILIrank: the largest reference drug list ranked by the risk for developing drug-induced liver injury in humans</article-title>
          .
          <source>Drug Discov Today</source>
          ,
          <volume>21</volume>
          (
          <issue>4</issue>
          ):
          <fpage>648</fpage>
          -
          <lpage>653</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <surname>Fabian</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Suchanek</surname>
          </string-name>
          , Serge Abiteboul, and Pierre Senellart. PARIS:
          <article-title>probabilistic alignment of relations, instances, and schema</article-title>
          .
          <source>PVLDB</source>
          ,
          <volume>5</volume>
          (
          <issue>3</issue>
          ):
          <fpage>157</fpage>
          -
          <lpage>168</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>Petar</given-names>
            <surname>Ristoski</surname>
          </string-name>
          and
          <string-name>
            <given-names>Heiko</given-names>
            <surname>Paulheim</surname>
          </string-name>
          .
          <article-title>Feature selection in hierarchical feature spaces</article-title>
          . In Saso Dzeroski, Pance Panov, Dragi Kocev, and Ljupco Todorovski, editors,
          <source>Discovery Science - 17th International Conference, DS</source>
          <year>2014</year>
          , Bled, Slovenia, October 8-
          <issue>10</issue>
          ,
          <year>2014</year>
          . Proceedings, volume
          <volume>8777</volume>
          of Lecture Notes in Computer Science, pages
          <fpage>288</fpage>
          -
          <lpage>300</lpage>
          . Springer,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <surname>Claudia d'Amato</surname>
            ,
            <given-names>Steffen</given-names>
          </string-name>
          <string-name>
            <surname>Staab</surname>
            , and
            <given-names>Nicola</given-names>
          </string-name>
          <string-name>
            <surname>Fanizzi</surname>
          </string-name>
          .
          <article-title>On the influence of description logics ontologies on conceptual similarity</article-title>
          .
          <source>In Aldo Gangemi and Jérôme Euzenat</source>
          , editors,
          <source>Knowledge Engineering: Practice and Patterns</source>
          , 16th International Conference, EKAW 2008, Acitrezza, Italy,
          <source>September 29 - October 2</source>
          ,
          <year>2008</year>
          . Proceedings, volume
          <volume>5268</volume>
          of Lecture Notes in Computer Science, pages
          <fpage>48</fpage>
          -
          <lpage>63</lpage>
          . Springer,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>Petar</given-names>
            <surname>Ristoski</surname>
          </string-name>
          and
          <string-name>
            <given-names>Heiko</given-names>
            <surname>Paulheim</surname>
          </string-name>
          .
          <article-title>A comparison of propositionalization strategies for creating features from linked open data</article-title>
          .
          <source>In Ilaria Tiddi</source>
          , Mathieu d'Aquin, and Nicolas Jay, editors,
          <source>Proceedings of the 1st Workshop on Linked Data for Knowledge Discovery co-located with European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD</source>
          <year>2014</year>
          ), Nancy, France,
          <year>September 19th</year>
          ,
          <year>2014</year>
          ., volume
          <volume>1232</volume>
          <source>of CEUR Workshop Proceedings. CEUR-WS.org</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>