Mining the Web of Data with Metaqueries

                                   Francesca A. Lisi

                           Dipartimento di Informatica &
             Centro Interdipartimentale di Logica e Applicazioni (CILA)
                  Università degli Studi di Bari “Aldo Moro”, Italy
                       FrancescaAlessandra.Lisi@uniba.it


       Abstract. The Web of Data uses the World Wide Web (WWW) in-
       frastructure to represent and interrelate data sources. These sources are
       referred to as knowledge graphs (KGs) and correspond to huge collections
       of facts in the form of RDF triples. The analysis of data contained in
       a KG is propaedeutic to several crucial KG curation tasks, notably the
       automated completion of the graph, which pose several challenges due to
       the open and distributed environment of the WWW infrastructure. How-
       ever, KG mining could take advantage of some useful meta-information
       about the data to be analyzed, for instance, the schema of the KG when
       available. In this paper, we resort to the notion of a metaquery, proposed
       in the 90s as a template for patterns one is interested to discover in a re-
       lational database. We propose to extend this notion to the novel context
       of the Web of Data, in particular to the case of KG mining. A distin-
       guishing feature of metaquerying problems is the use of a second-order
       logic language. In this paper, we present a metaquery language based on
       second-order Description Logics but implementable with standard tech-
       nologies underlying the Web of Data, and briefly describe mechanisms
       for answering such metaqueries in the context of interest.

       Keywords: Metaquerying · Knowledge Graphs · Rule Mining.


1     Background and Motivation

The current vision of the World Wide Web (WWW) is that of a Web of Data
(WoD), which highlights the central role of data in the WWW. More precisely,
the WoD builds upon the WWW infrastructure to represent and interrelate
data (aka Linked Data), with the aim of transforming the Web from a dis-
tributed file system into a distributed database system. The foundational stan-
dards of the WoD include the Uniform Resource Identifier (URI) and the Re-
source Description Framework (RDF)1 . The former is used to identify resources
whereas the latter is used to relate resources. In particular, RDF can be con-
sidered as a data model according to which data is represented in the form of
triples hsubject predicate objecti. Huge collections of these triples can be then
organized into directed, labeled graphs known as knowledge graphs (KGs). A
1
    https://www.w3.org/RDF/
                                F.A. Lisi


             Fig. 1. Fragment of a knowledge graph (taken from [15]).


typical case of a large KG is DBPedia,2 which, essentially, makes the content
of Wikipedia 3 available in RDF and incorporates links to other datasets on the
Web, e.g., to Geonames 4 . An interesting point for the ILP community is that
RDF triples can be straightforwardly represented by means of unary and bi-
nary first-order logic (FOL) predicates. More precisely, the unary predicates are
the objects of the RDF type predicate, while the binary ones correspond to all
other RDF predicates, e.g., halice type researcheri and hbob isM arriedT o alicei
from the KG in Fig. 1 refer to researcher(alice) and isM arriedT o(bob, alice)
respectively. KGs can be accessed by posing queries with the RDF query lan-
guage SPARQL5 . Several entailment regimes are available for query answering
in SPARQL which are based on the existing link between RDF and the family
of Description Logics (DLs) [1].
    A distinguishing feature of KGs is their inherent incompleteness. Indeed, they
are constructed by automatically extracting available information from existing
Web information sources (see, e.g., the above mentioned case of DBPedia). The
curation of KGs is cumbersome due to their huge size. In particular, a major
activity aims at the completion of the KG in hand, in order to address the
issue of incompleteness. For instance, in link prediction, relational data mining
algorithms can be exploited to automatically build rules able to make predictions
on missing links. For example, the following rule

              isM arriedT o(X, Y ), livesIn(X, Z) ⇒ livesIn(Y, Z)             (1)

can be mined from the KG in Fig. 1 and applied to derive new facts such as
livesIn(alice, berlin), livesIn(dave, chicago) and livesIn(lucy, amsterdam) to
be used for completing the graph. However, the extension of relational data
2
  http://wiki.dbpedia.org/
3
  https://www.wikipedia.org/
4
  http://www.geonames.org/
5
  https://www.w3.org/TR/sparql11-overview/
                                    Mining the Web of Data with Metaqueries

mining to the WoD context is not straightforward. Indeed, being intrinsically in-
complete, KGs are naturally treated under the Open World Assumption (OWA)
as opposed to databases for which the Closed World Assumption (CWA) holds.
Nevertheless, KG mining algorithms could take advantage of some useful meta-
information about the KG in hand, e.g., domains, ranges and confidence values
of relations inside the KG (i.e., its schema). In this paper we resort to the notion
of a metaquery which was proposed in the 90s as a template that describes the
type of pattern one is interested to discover in a relational database [14,2,3]. A
common feature to metaquerying problems is the use of a second-order logic lan-
guage. For KG mining we devise a metaquery language based on second-order
DLs. A first step towards this direction of research was taken in [11]. In the
present paper we provide a more detailed description of the language (which
was only sketched in [11]) and a preliminary analysis of the necessary steps for
answering metaqueries in the proposed language.
    The rest of the paper is structured as follows. Section 2 presents syntax and
semantics of the metaquery language. Section 3 briefly describes mechanisms for
answering metaqueries. Section 4 concludes the paper with final remarks and
directions of future work.


2     A Metaquery Language for the Web of Data
In our proposal for the WoD context, a metaquery is a second-order DL con-
junctive query. In the following we gently introduce the reader to this notion.

Syntax Let L be a DL language with syntax (NC , NR , NO ) where NC , NR ,
and NO are the alphabet of concept, role, and individual names, respectively.
Note that concepts and roles in the DL terminology correspond to classes and
properties in RDF.
    First, we consider the extension of L so that we can enable the formulation
of conjunctive queries (which go beyond the standard way of querying a DL
knowledge base). For this purpose, let VO be a countably infinite set of individual
variables disjoint from NC , NR , and NO . A term t is an element from VO ∪ NO .
Let c be a concept, r a role, and t, t′ terms.6 An atom is an expression which
can take three different forms: c(t), r(t, t′ ), or t ≈ t′ . We refer to these three
kinds of atoms as concept atoms, role atoms, and equality atoms respectively. A
conjunctive query (CQ) of arity n is an expression of the form

                            q(X1 , . . . , Xn ) ← a1 , . . . , am                (2)

where q, called the query predicate, does not belong to NC ∪ NR ∪ NO ∪ VO , every
Xi belongs to VO , every aj is a (possibly non-ground) atom, and all variables Xi
occur in some aj . The variables Xi are called the free variables (aka distinguished
variables) of the query, whereas the other variables appearing in a1 , . . . , am are
6
    In DLs concept and role names are usually capitalized. However, for the sake of
    clarity, here we shall use capital letters only for variables.
                                 F.A. Lisi

called existential variables. A CQ is called Boolean if it has no free variable. An
example of CQ is the following

          q(Y, Z) ← isM arriedT o(X, Y ), livesIn(X, Z), livesIn(Y, Z)          (3)

where Y and Z are the free variables, X is the existential one, and isM arriedT o(X, Y ),
livesIn(X, Z), livesIn(Y, Z) are role atoms.
    Since we are interested in second-order CQs, we need to introduce two further
sets of variables (of second-order this time): VC of so-called concept variables, i.e.
variables that can be quantified over NC , and VR of so-called role variables, i.e.
variables that can be quantified over NR . Let then MQ(L) be the second-order
DL language obtained by extending L with VC and VR . For the purpose of this
work, we can restrict MQ(L) to particular (second-order) CQs, e.g., involving
only role variables and individual variables. An example of one such metaquery
is the following

                    M Q1 : mq(Q, Y, Z) ← P (X, Y ), Q(X, Z)                     (4)

which looks for the properties (Q) holding for the individuals Y . Note that
P, Q ∈ VR whereas X, Y, Z ∈ VO Metaqueries are the starting point for the
definition of so-called metaquery extensions, i.e., implications of the form

                                  M Q1 → M Q2                                   (5)

which are actually a compact representation of two metaqueries, M Q1 and M Q2 ,
where M Q2 is longer than - we say extends - M Q1 . A shorter notation for (5)
is the following which stresses how M Q2 extends M Q1

                             M Q1 ⇒ (M Q2 \ M Q1 )                              (6)

The left-hand side and the right-hand side of (6) are called the body and the
head of the metaquery extension, respectively. Note that in the case of query
extensions, the head does not correspond to the conclusion (as with clauses).
Following the standard terminology, one should rather bear in mind the un-
shortened notation, and call M Q2 the conclusion of the metaquery extension.
For instance, let us consider the following metaquery

               M Q2 : mq(Q, Y, Z) ← P (X, Y ), Q(X, Z), Q(Y, Z)                 (7)

which looks for the properties (Q) holding for the individuals Y and shared with
the individuals X to which Y is related by some P . From (4) and (7) we can
build a metaquery extension as shown below

                          P (X, Y ), Q(X, Z) ⇒ Q(Y, Z)                          (8)

   Metaquery extensions serve as a template for rules we are interested in when
applying rule mining algorithms to a given KG.
                                   Mining the Web of Data with Metaqueries

Semantics As for the semantics of MQ(L), we plan to follow the Henkin style
[9] for the following reasons. As opposed to the Standard Semantics, in the
Henkin semantics the expressive power of the language actually remains first-
order. This is a desirable feature because it paves the way for the use of first-order
solvers in spite of the second-order syntax. Also, this is a shared feature with
RDF(S). Last, but not least, it makes possible an implementation with SPARQL.
    A few remarks are necessary here about the use of the symbol ⇒ in (1) and
(8). Differently from ←, it does not represent the logical implication. However, as
discussed in Sect. 3, it can be treated as such in contexts like the aforementioned
link prediction problem, provided that an appropriate choice of rule evaluation
measures is done.


3    Answering Metaqueries in the Web of Data

As said in the previous section, metaquery extensions are nothing but a com-
pact representation of two metaqueries. Therefore in this section we limit our
discussion to metaqueries.
    The process of answering a metaquery can be divided into two stages. In the
first stage, which we call the instantiation stage, we look for sets of concepts
and roles that match the pattern determined by the metaquery. In the second
stage, which we call the filtration stage, we filter out all the rules that match the
pattern of the metaquery but do not satisfy some predefined evaluation criteria.


Instantiation Instantiating a metaquery is similar to solving a Constraint Sat-
isfaction Problem (CSP) where one is interested in finding all solutions of the
CSP problem. The instantiation step is practically possible when the schema
of the KG in hand is available, which is not the case for every KG. Indeed the
schema provides the signature of the relations occuring in the KG, thus making
this step an informed search rather than a blind search. For instance, with ref-
erence to the KG depicted in Fig. 1, (1) is an instantiation of (8) obtained by
substituting the role variables P and Q with the role names isM arriedT o and
livesIn, respectively.


Filtration At this stage the rules like (1) obtained by instantiating the given
metaquery extension are evaluated according to some interestingness measures.
For instance, we could aim at filtering out rules with low support and confidence
values. In this case it is reasonable to compute confidence only for rules with
sufficient support.
    Following [7], the (absolute) support of a CQ Q in a KG G is the number of
distinct tuples in the answer of Q on G. The relative support of h(X , Y ) over G
is defined as follows:
                                     #(X, Y ) : h(X, Y ) ∈ G
    supp(h(X, Y ), G) =                                                           (9)
                          (#X : ∃Y h(X, Y ) ∈ G) ∗ (#Y : ∃X h(X, Y ) ∈ G)
                                  F.A. Lisi

The confidence of a rule w.r.t. G can be defined starting from (9). However, in
order to estimate the actual implication of the rule at hand, we could exploit
the rule evaluation measure called conviction [4]. This choice is thus particularly
attractive for the KG completion task.


4   Conclusions and Future Work

In this paper we have briefly presented a new approach to mining the Web of
Data. The approach adapts the notion of metaquery introduced by [14] for rela-
tional data mining to the novel context of KG mining. The idea of considering
extensions of metaqueries is inspired by [7] in their seminal work on ILP for as-
sociation rule mining. However, differently from [14,7], our proposed metaquery
language is based on second-order DLs but can be implemented with standard
technologies of the Web of Data. The importance of metamodeling (of which
metaquerying is a special case) in several applications has been recently recog-
nized in the DL community. In particular, De Giacomo et al. [6] augment a DL
with variables that may be interpreted - in a Henkin semantics - as individuals,
concepts, and roles at the same time, obtaining a new logic Hi(DL). Colucci et
al. [5] introduce second-order features in DLs under Henkin semantics for mod-
eling several forms of non-standard reasoning. Lisi [10,12] extends [5] to some
variants of concept learning, thus being the first to propose higher-order DLs as
a means for metamodeling in data mining.
    In the KG community approaches for link prediction are divided into statistics-
based (see [13] for an overview), and logic-based (e.g., [8,15]), which are the clos-
est to our work. The latter basically extend and adapt previous work in ILP on
relational association rule mining. However, they differ in the expressiveness of
the mined rules. AMIE+ [8] can mine only Horn rules, whereas the methodology
described in [15] can deal with the case of nonmonotonic rules.
    In the future, several aspects of the proposed approach should be clarified
before an implementation. First, we need to better define the semantics for the
proposed metaquery language, also concerning the link with SPARQL. Second,
we need to design algorithms for the instantiation stage and choose the most
appropriate evaluation measures for the intended application.


References
 1. Baader, F., Calvanese, D., McGuinness, D., Nardi, D., Patel-Schneider, P. (eds.):
    The Description Logic Handbook: Theory, Implementation and Applications (2nd
    ed.). Cambridge University Press (2007)
 2. Ben-Eliyahu-Zohary, R., Gudes, E.: Towards efficient metaquerying. In: Dean, T.
    (ed.) Proceedings of the Sixteenth International Joint Conference on Artificial In-
    telligence, IJCAI 99, Stockholm, Sweden, July 31 - August 6, 1999. 2 Volumes,
    1450 pages. pp. 800–805. Morgan Kaufmann (1999)
 3. Ben-Eliyahu-Zohary, R., Gudes, E., Ianni, G.: Metaqueries: Semantics, complexity,
    and efficient algorithms. Artificial Intelligence 149(1), 61–87 (2003)
                                    Mining the Web of Data with Metaqueries

 4. Brin, S., Motwani, R., Ullman, J.D., Tsur, S.: Dynamic itemset counting and
    implication rules for market basket data. In: Peckham, J. (ed.) SIGMOD 1997,
    Proceedings ACM SIGMOD International Conference on Management of Data,
    May 13-15, 1997, Tucson, Arizona, USA. pp. 255–264. ACM Press (1997).
    https://doi.org/10.1145/253260.253325
 5. Colucci, S., Di Noia, T., Di Sciascio, E., Donini, F.M., Ragone, A.: A unified
    framework for non-standard reasoning services in description logics. In: Coelho,
    H., Studer, R., Wooldridge, M. (eds.) ECAI 2010 - 19th European Conference on
    Artificial Intelligence, Lisbon, Portugal, August 16-20, 2010, Proceedings. Frontiers
    in Artificial Intelligence and Applications, vol. 215, pp. 479–484. IOS Press (2010)
 6. De Giacomo, G., Lenzerini, M., Rosati, R.: Higher-order description logics for do-
    main metamodeling. In: Burgard, W., Roth, D. (eds.) Proceedings of the Twenty-
    Fifth AAAI Conference on Artificial Intelligence, AAAI 2011, San Francisco, Cal-
    ifornia, USA, August 7-11, 2011 (2011)
 7. Dehaspe, L., De Raedt, L.: Mining Association Rules in Multiple Relations. In:
    Lavrač, N., Džeroski, S. (eds.) Inductive Logic Programming. Lecture Notes in
    Artificial Intelligence, vol. 1297, pp. 125–132. Springer (1997)
 8. Galárraga, L., Teflioudi, C., Hose, K., Suchanek, F.M.: Fast rule mining in on-
    tological knowledge bases with AMIE+. VLDB Journal 24(6), 707–730 (2015),
    https://doi.org/10.1007/s00778-015-0394-1
 9. Henkin, L.: Completeness in the theory of types. Journal of Symbolic Logic 15(2),
    81–91 (1950)
10. Lisi, F.A.: A declarative modeling language for concept learning in description
    logics. In: Riguzzi, F., Zelezny, F. (eds.) Inductive Logic Programming, 22nd In-
    ternational Conference, ILP 2012, Dubrovnik, Croatia, September 17-19, 2012,
    Revised Selected Papers. Lecture Notes in Computer Science, vol. 7842. Springer
    Berlin Heidelberg (2013)
11. Lisi, F.A.: Towards a metaquery language for mining the web of data. In: Calı̀,
    A., Wood, P.T., Martin, N.J., Poulovassilis, A. (eds.) Data Analytics - 31st British
    International Conference on Databases, BICOD 2017, London, UK, July 10-12,
    2017, Proceedings. Lecture Notes in Computer Science, vol. 10365, pp. 90–93.
    Springer (2017). https://doi.org/10.1007/978-3-319-60795-5 8
12. Lisi, F.A.: A model+solver approach to concept learning. In: Adorni, G., Cagnoni,
    S., Gori, M., Maratea, M. (eds.) AI*IA 2016: Advances in Artificial Intelli-
    gence - XVth International Conference of the Italian Association for Artifi-
    cial Intelligence, Genova, Italy, November 29 - December 1, 2016, Proceedings.
    Lecture Notes in Computer Science, vol. 10037, pp. 266–279. Springer (2016),
    https://doi.org/10.1007/978-3-319-49130-1 20
13. Nickel, M., Murphy, K., Tresp, V., Gabrilovich, E.: A review of relational machine
    learning for knowledge graphs. Proceedings of the IEEE 104(1), 11–33 (2016).
    https://doi.org/10.1109/JPROC.2015.2483592
14. Shen, W., Ong, K., Mitbander, B.G., Zaniolo, C.: Metaqueries for data mining.
    In: Advances in Knowledge Discovery and Data Mining, pp. 375–398. AAAI/MIT
    Press (1996)
15. Tran, H.D., Stepanova, D., Gad-Elrab, M.H., Lisi, F.A., Weikum, G.: To-
    wards nonmonotonic relational learning from knowledge graphs. In: Cussens,
    J., Russo, A. (eds.) Inductive Logic Programming - 26th International Confer-
    ence, ILP 2016, London, UK, September 4-6, 2016, Revised Selected Papers.
    Lecture Notes in Computer Science, vol. 10326, pp. 94–107. Springer (2017).
    https://doi.org/10.1007/978-3-319-63342-8 8