=Paper=
{{Paper
|id=Vol-1810/GraphQ_paper_11
|storemode=property
|title=Recurring Retrieval Needs in Diverse and Dynamic Dataspaces: Issues and Reference Framework
|pdfUrl=https://ceur-ws.org/Vol-1810/GraphQ_paper_11.pdf
|volume=Vol-1810
|authors=Barbara Catania,Francesco De Fino,Giovanna Guerrini
|dblpUrl=https://dblp.org/rec/conf/edbt/CataniaFG17
}}
==Recurring Retrieval Needs in Diverse and Dynamic Dataspaces: Issues and Reference Framework==
<pdf width="1500px">https://ceur-ws.org/Vol-1810/GraphQ_paper_11.pdf</pdf>
<pre>
     Recurring Retrieval Needs in Diverse and Dynamic
      Dataspaces: Issues and Reference Framework

                         Barbara Catania, Francesco De Fino, Giovanna Guerrini
                                                     University of Genoa, Italy
                                         {firstname.lastname}@dibris.unige.it


ABSTRACT                                                            is, a set of smart information services offered by a
Processing information requests over heterogeneous                  municipality to users that, e.g., retrieves information
dataspaces is very challenging because aimed at guar-               about attractions, points of interests, and shops. A
anteeing user satisfaction with respect to conflicting              user might, for instance, ask for the authors of the fig-
requirements on result quality and response time. In                urative artworks she is watching (i.e., they are located
[3], it has been argued that, in dynamic contexts pre-              close to her position), together with information about
venting substantial user involvement in interpreting                other places where such authors are currently expos-
requests, information on similar requests recurring over            ing. In this specific context, data may come from dif-
time can be exploited during query processing.                      ferent datasets, are heterogeneous and very dynamic,
   In this paper, referring to a graph-based modeling               as the user may quickly change her position or the en-
of dataspaces and requests, we propose a preliminary                vironment itself (including data) might be different at
approach in this direction centered on the enabling                 different instants of time.
concept of Profiled Graph Query Pattern (PGQP) as                      Processing complex requests on diverse and dynamic
an aggregation of information on past evaluations of                sources requires: (i) request interpretation; (ii) source
a set of previously executed queries. The information               selection; (iii) actual processing of the request on the
maintained in PGQP is not query results, as in mate-                relevant, and usually big in size, sources. The overall
rialized queries, but can include different kinds of data           process is costly and, nevertheless, it may not guar-
and metadata.                                                       antee user satisfaction on the obtained result since
                                                                    the request (i) could be incorrectly interpreted, (ii)
                                                                    could be processed on inaccurate, incomplete, unre-
1.    INTRODUCTION                                                  liable data, or (iii) could require a processing time
Motivations. The last years have been character-                    inadequate to its urgency.
ized by a tremendous growth of available informa-                      To make processing time more adequate to the ur-
tion, coming from information sources that are highly               gency of the request, approximate query processing
heterogeneous in terms of structure, semantic rich-                 approaches can be considered, by providing a fast an-
ness, and quality. Sources can indeed contain unstruc-              swer at the cost of a lower accuracy [4]. Additionally,
tured data as well as data with well-defined structure,             to improve the interpretation of the request, and re-
strongly correlated and semantically complex but rel-               lated source selection, one possibility is to rely on user
atively static data (e.g., Linked Open Data), highly                involvement. Novel paradigms for exploratory user-
dynamic user-generated data. This information is,                   data interactions, that emphasize user context and
however, a resource currently exploited much below                  interactivity with the goal of facilitating exploration,
its potential because of the difficulties for the users             interpretation, retrieval, and assimilation of informa-
in accessing it. Users would like to find answers to                tion, are being developed. Such solutions, however,
complex requests expressing relationships among the                 do not seem adequate in dynamic contexts, character-
entities of their interest, but they are able to specify            ized by urgent requests and by communication means
requests only vaguely because they cannot reasonably                hampering user interaction.
know the format and structure of the data that encode                  Other innovative approaches should therefore be de-
the information relevant to them.                                   vised, relying on different kinds of information for find-
   As an example, consider a smart city explorer, that              ing possibly approximate answers to complex infor-
                                                                    mation needs, even vaguely and imprecisely specified,
                                                                    operating on the full spectrum of relevant content in
                                                                    a dataspace of highly heterogeneous and poorly con-
                                                                    trolled sources. In [3], to overcome difficulties related
                                                                    to heterogeneity and dynamic nature, the exploitation
                                                                    of additional information, in terms of user profile and
2017, Copyright is with the authors. Published in the Workshop
Proceedings of the EDBT/ICDT 2017 Joint Conference (March 21,       request context, data and processing quality, similar
2017, Venice, Italy) on CEUR-WS.org (ISSN 1613-0073). Distri-       requests recurring over time, is suggested.
bution of this paper is permitted under the terms of the Creative      In this paper, we focus on similar requests recurring
Commons license CC-by-nc-nd 4.0
over time for improving approximate processing of re-         spond to previously computed results, rather it ranges
quests that we assume already interpreted and rep-            from query results to any other higher level informa-
resented according to a given formalism. Recurring            tion collected during query processing (e.g., query-
retrieval needs are very common in dynamic contexts,          to-data mappings). In specifying the framework, we
such as, for instance, during or after an exceptional         identify the need for PGQP-aware approximate query
event (environmental emergencies or flash mobbing             processing techniques and PGQP management issues
initiatives), in the context of users belonging to the        raised by the framework. Each instantiation of the
same community or that are in the same place, pos-            framework will lead to the definition of a specific PGQP-
sibly at different times. The information needs are           aware query processing approach for a given set of
widespread among different users, because induced by          information associated with PGQPs. Approximated
the event, the interests of the community, and the            results can be generated for two main reasons: due
place, respectively. We aim at taking advantage of the        to the dynamic nature of the dataspace, information
experience gained by prior processing in new searches         associated with PGQPs may change between two dis-
for limiting interpretation errors and response time,         tinct executions of the same query; PGQPs are se-
thus reducing the possibility of producing unsatisfac-        lected based on similarity criteria with respect to the
tory answers.                                                 query at hand.
   Recurring retrieval needs have been considered in             The paper finally proposes a specific instantiation
query processing for a very long time in terms of ma-         of the framework, in the context of which the various
terialized views (see, e.g., [7, 8]) The idea is to precom-   components and concepts are formalized, and algo-
pute the results of some recurring queries as materi-         rithms for the most relevant tasks are specified. In
alized views, select some of such views (view selection       this instantiation, we formally define PGQPs but we
problem) and re-use them (view-based query process-           leave to future work the association of PGQPs with
ing) for the execution of new requests. Unfortunately,        information collected in prior executions.
the usage of materialized views in the contexts de-
scribed above suffers of some problems: (i) views as-         Paper organization. The proposed framework is
sociate a result to a given query, however in the ref-        presented in Section 2 and a specific instantiation of
erence contexts we may be interested in associating           the framework is defined in Section 3. Section 4 con-
other or additional information (e.g., the set of used        cludes by discussing a number of open issues we are
data sources or, under pay-as-you-go integration ap-          investigating in this work-in-progress.
proaches [6], query-to-data mappings); (ii) view up-
dates are very frequent in dynamic environments, re-
ducing the efficiency of the overall system; (iii) view-      2.   PGQP-BASED FRAMEWORK
based query processing techniques usually rely on a              The framework we propose is based on a graph-
precise semantics while heterogeneity tends to favour         based representation of dataspaces, interpreted as a
approximation-based approaches.                               collection of (possibly) schemaless data sources, and of
   Alternative usages of the concept of recurring re-         user retrieval needs, expressed as entity relationships
trieval needs in query processing have been recently          queries over such dataspaces [14]. Specifically, we as-
provided but, also in this case, precise query pro-           sume to deal with data sources represented in terms
cessing is taken into account. For example, in [13],          of partially specified graphs, where some information,
common needs are used with the aim of expediting              associated with nodes, may be unknown [2]. Similarly,
the processing of graph queries against a database of         each request is represented in terms of a graph, speci-
graphs, by reducing the number of subgraph isomor-            fying entities and relationships of interest, which may
phism tests to be performed.                                  contain node variables, in order to characterize infor-
                                                              mation to be retrieved [2]. As an example, Figure 2
Contributions. This paper presents the preliminary            (a) presents a graph query Q corresponding to the in-
results of an on-going work aiming at exploiting in-          formation need “the authors of the figurative artworks
formation about similar requests recurring over time          the user is watching (i.e., they are located close to her
in approximate query processing of requests executed          position), together with such artworks, a biography of
over diverse and dynamic dataspaces. We refer to a            the authors, and information about places in Genoa
graph-based representation of dataspaces, interpreted         where the authors are currently exposing their art-
as a collection of (possibly) schemaless data sources,        works”. Notice that, since the user request is related
and on a graph-based language for modeling user re-           to her current position, it needs to be interpreted and
trieval needs, expressed as entity relationships queries      processed before such position changes.
over such dataspaces. We propose a framework for                 For representing and managing (sets of) recurring
representing and managing (sets of) recurring user re-        user requests, the framework relies on the enabling
quests, and for exploiting them in approximate query          concept of Profiled Graph Query Pattern (PGQP). Each
processing. The proposed framework is based on the            PGQP corresponds to a graph, associated with data or
concept of Profiled Graph Query Pattern (PGQP). Like          metadata related to the past evaluation of the queries
materialized views, each profiled graph query pattern         it represents. Such information does not necessar-
corresponds to a kind of index value associated with          ily correspond to previously computed results, thus,
information related to the past evaluation of the queries     PGQPs do not necessarily correspond to materialized
it represents; however, differently from materialized         views. Rather, any kind of metadata collected dur-
views, such information does not necessarily corre-           ing query processing, as the set of used data sources,
                                        Figure 1: PGQP-based framework


query-to-data mappings under pay-as-you-go integra-        cific PGQP-based processing algorithms should be de-
tion approaches [6], or result quality information, can    signed, depending on the information associated with
be considered. As an example, Figure 2 presents a          PGQPs. On the other hand, query Qsub , which repre-
graph query Q and two PGQPs φ1 and φ2 .                    sents the residual part of Q, has to be executed based
   Independently from the information associated with      on a traditional graph query processing algorithm (see,
PGQPs, the proposed framework aims at addressing           e.g. [17]). At the end of the processing, all partial re-
three main problems which can be stated as follows         sults are merged through the fusion operator θ and
(see Figure 1 for an overall picture).                     the result is returned to the user.
                                                              Referring to Figure 2(a), Qφ1 is processed using in-
PGQP-based query decomposition. Given a query              formation associated with φ1 while Qsub is processed
Q and a set S of PGQPs, defined in terms of graph          by a traditional query processing algorithm.
queries, determine whether it is possible to rely on S
for executing Q. This is possible if Q can be approx-      PGQP set update. For each graph query Q, exe-
imated by a new query, obtained by composing a set         cuted by the system without taking into account
of Q subgraphs, which best match PGQPs, with the           PGQPs, information related to its processing is col-
portion of Q which cannot be represented in terms of       lected (depending on the aim of the approach) and
S. More formally, we look for a subgraph Qsub of Q,        used, together with Q, for updating the set of avail-
a PGQP subset S⊆ ⊆ S, and a fusion operator θ such         able PGQPs at the end of the processing.
that Q can be rewritten in a new approximate query            As an example, Figure 2(d) shows PGQP φ1 ex-
Qa = θ({Qφi |φi ∈ S⊆ }, Qsub ), where Qφi denotes the      tended with the part of the query executed by a tra-
(sub)graph of Q for which an approximate match with        ditional query processing algorithm.
(a subgraph of) φi exists, and Qa satisfies some op-
timality criteria. That is, PGQPs in S⊆ are selected          The framework described above can be instantiated
according to a similarity function δ which quantifies      in several ways. As an example, consider an instan-
how close the query graph and each PGQP are.               tiation targeted to source selection. In this case, the
   With reference to Figure 2, various decompositions      information associated with PGQP φ1 could be the
of Q are possible based on φ1 and φ2 . Assuming φ1         sources exploited during the past processing of the
is more similar to Q than φ2 , the decomposition of Q      queries it represents to retrieve the relevant informa-
can return: (i) S⊆ = {φ1 }; (ii) Qsub defined as shown     tion. Thus, for instance, information relevant for φ1
in Figure 2(a); (iii) θ defined as the join between the    may come from a dynamic data source s1 contain-
partial results of Qφ1 (tuples for variables ?a, ?m, ?b)   ing information about art exhibitions of various artists
and Qsub (tuples for variables ?a,?b). Notice that, by     (left part of φ1 ) and a static data source s2 (such as
interpreting PGQPs as views, traditional view selec-       dbpedia) containing biographies of artists (right part).
tion algorithms would not have selected any PGQP,          An important difference with respect to the material-
due to the approximate matches between “Genoa” and         ized view approach is thus that the selected PGQP is
“GE” and “Figurative” and “Figurative Art”. Note           not associated with query results (that would have re-
that, even if in this example Qφ1 totally matches φ1 ,     quired a frequent recomputation, given the dynamicity
partial matches are also possible.                         of data sources, e.g., in art exhibits) but the sources
                                                           contributing data to the result. PGQP-based process-
PGQP-aware query processing. In order to opti-             ing would rely on this information for targeting to
mize Qa evaluation, we can rely on information asso-       these sources the execution of Qφ1 . As an effect of
ciated with PGQPs in S⊆ . More precisely, each graph       the execution of Q, a third data source s3 containing
query Qφi can be executed taking into account in-          museum catalogues with artwork positions is identi-
formation related to its prior execution, that is, in-     fied, and associated with the updated PGQP (shown
formation associated with φi in S. To this aim, spe-       in Figure 2(d)).
Figure 2: Example PGQP: (a) graph query Q; (b) PGQP φ1 ; (c) PGQP φ2 ; (d) PGQP φ1 after the update phase


3.     A PRELIMINARY PGQP-AWARE                             conjunctive queries and we denote with UACQ the re-
       PROCESSING ALGORITHM                                 sulting language [1, 2]. Formally, an UACQ query Q
                                                            is the union of expressions Q1 , ...Qk where each Qh ,
  In this section, we select suitable models for datas-     1 ≤ h ≤ k, is an expression of the form
paces and user requests, we formalize the concept of                                        ^
PGQP and we discuss their use and management in a                      ans(z1 , ..., zn ) ←     (xi , ai , yi )
preliminar instantiation of the framework.                                              1≤i≤m

3.1      Dataspace and User Requests                        such that the graph underlying Qh is acyclic, m > 0,
Data Space Representation. Under the reference              each xi and yi is a node variable or a constant (1 ≤
context, data spaces may contain redundant or even          i ≤ m), each ai ∈ Σ (1 ≤ i ≤ m), and each zi is some
missing information. Furthermore, the heterogeneous         variable xj or yj (1 ≤ i ≤ n, 1 ≤ j ≤ m). The query
nature of data leads to schemaless representations. To      Q is Boolean if ans(), i.e. n = 0.
take care of these features, we assume the data space         UACQ queries can always be interpreted as graph
is represented in terms of a collection of graph pattern    (pattern) queries Q = (, z) where  is the graph pat-
databases, i.e., graphs where nodes can also be labeled     tern underlying Q and z = (z1 , ..., zn ).
with variables, for representing situations in which the      The semantics of a UACQ query Q = (, z) when ex-
node identity is not known [1]).                            ecuted over a graph pattern database π = (N, E) can
                                                            be defined as the set of answers which are returned
   Definition 1. Let N be a set of node labels. Let         for each instance of the graph pattern database. Ac-
Vnode be a set of node variables. Let Σ be an arbitrary     cording to [2], we call them certain answers and we
(finite or infinite) set of symbols. A Graph Pattern        compute them as:
Database over Σ is a pair π = (N, E) where: (i) N ⊆                certainΣ (Q, π) = {v ∈ N n |π |= [v/z]}
N ∪Vnode is the finite set of nodes; (ii) E ⊆ N ×Σ×N
is the set of edges.                                 2      where, given two graph patterns π1 and π2 , π1 |= π2
                                                            (π1 implies π2 ) if [[π1 ]]Σ ⊆ [[π2 ]]Σ .
  The semantics of a graph pattern database corre-
sponds to all graphs which are instances of the pat-        3.2    Profiled Graph Query Patterns
tern and it is defined via homomorphisms. Given a              A PGQP represents in a compact way a set of graph
graph G = (N 0 , E 0 ) and a graph pattern database         queries executed in the past. In order to provide a for-
π = (N, E), a homomorphism h : π → G is a mapping           mal definition of PGQPs, we propose to consider union
h : N → N 0 that maps labels and variables used in π        as aggregate operator of the component queries and to
to labels used in G such that:                              represent a PGQP as a UACQ query. Since PGQPs
     1. h1 (n) = n, for every node label n ∈ N ; and        are used for the approximate evaluation of queries, the
                                                            issue also arises of determining whether each PGQP
     2. for every edge (n1 , e, n2 ) ∈ E, there is a path   represents, in a reasonably good way, (a portion of) a
        (h(n1 ), e, h(n2 )) in G.                           graph query Q. If this is the case, φ becomes a candi-
                                                            date for the PGQP-aware processing of Q and we say
  The semantics of a graph pattern w.r.t. the label-        that Q matches the PGQP. PGQP candidates can be
ing alphabet Σ is [[π]]Σ = {G over Σ|G |= π}, where         selected through approximate subgraph matching [12]
G |= π holds if a homomorphism h : π → G exists.            and the usage of a graph similarity function.
Graph Query Language. In this work, for the sake               Definition 2. Let Q1 = (N1 , E1 ), Q2 = (N2 , E2 ) ∈
of simplicity and for guaranteeing a tractable com-         U ACQ. An Approximate Subgraph Match between Q1
bined and data complexity (fundamental requirements         and Q2 is a bijection mapping λ : N10 ↔ N20 , where
for dealing with massive in size graph databases), we       Ni0 ⊆ Ni , i = 1, 2. An Approximate Subgraph Match-
consider graph queries specified as unions of acyclic       ing Function is a function m that, for each pair of
Algorithm 1 PGQP-based query decomposition                     Algorithm 2 PGQP set update
INPUT                                                          INPUT
    Q: graph query                                                 Q: graph query
    S = {φ1 , φ2 , ..., φn }: set of PGQPs                         S = {φ1 , φ2 , ..., φn }: set of PGQPs
    δm : a graph similarity function induced by an approxi-        vc : maximum allowed transformation cost
    mate subgraph matching function m                          OUTPUT
    vs : minimum allowed graph similarity                          S: an updated set of PGQPs
OUTPUT                                                         APPROACH
    (S⊆ , Qsub ), such that S⊆ ⊆ S and Qsub is a subgraph of    1: Let φh ∈ S such that CT (Q, φh ) = min1≤i≤n CT (Q, φi )
    Q                                                           2: if (CT (Q, φh ) ≤ vc ) then
APPROACH                                                        3:      φh = T (Q, φh )
 1: Let φk ∈ S such that δm (Q, φk ) = max1≤i≤n δm (Q, φi )     4:      S = S \ {φh } ∪ {φh }
 2: if δm (Q, φk ) > vs then                                    5: else
 3:      return ({Qφk }, Q − Qφk )                              6:      S = S ∪ {Q}
 4: else                                                        7: return S
 5:      return ({}, Q)


                                                               Qφ and then all remaining non connected nodes be-
Q1 , Q2 ∈ U ACQ, returns an approximate subgraph               longing to Qφ ). This consideration guides us in the
match from Q1 to Q2 . A Graph Similarity Function              definition of a preliminary decomposition algorithm
is a metric function δ : U ACQ × U ACQ → [0, 1] such           (see Algorithm 1). Given a query Q and a set of
that, for each Q1 , Q2 ∈ U ACQ, δ(Q1 , Q2 ) quantifies         PGQPs S, the PGQP φk providing the highest simi-
the similarity between Q1 and Q2 . δ is induced by an          larity value with respect to Q is selected and, if such
Approximate Subgraph Matching Function m (denoted              value is greater than a given threshold (thus, Q and φk
by δm ) if the value assigned to (Q1 , Q2 ) quantifies the     are close enough), the binary decomposition described
similarity between subgraphs Q01 and Q02 involved in           above is returned. Otherwise, no decomposition is ap-
the match m(Q1 , Q2 ).                                  2      plied.
  PGQPs and PGQP matches can now be defined.                   3.4    PGQP Set Update
                                                                 In order to update the PGQP store, we rely on a
   Definition 3. A Profiled Graph Query Pattern φ              graph transformation function.
is a graph query in UACQ. Let δm be a graph sim-
ilarity function induced by an approximate subgraph               Definition 4. A Graph Transformation Function
matching function m. A graph query Q matches a                 is a function T : U ACQ × U ACQ → U ACQ such
PGQP φ, with similarity v ∈ (0, 1], if δm (Q, φ) = v.          that, for each query Q and PGQP φ, Q is a match of
Qφ denotes the (sub)graph of Q involved in the match.          T (Q, φ), based on δm . 2
Q is a total match of φ if Qφ = Q, it is a partial match
otherwise.                                             2          Transformation functions can be defined in terms
                                                               of a small set of operations, previously defined in the
  In the following, we assume that function δm is pro-         context of the so called tree editing problem [16], which
vided by the system and used during PGQP-based                 at least contains node relabelling, deletion and inser-
query decomposition and PGQP set update.                       tion. Each transformation function can be defined
                                                               by specifying which operations should be applied to
3.3    PGQP-based Query Decomposition                          a given PGQP for each node and edge of the query
   In the following, we present a preliminary PGQP-            Q at hand, in such a way that Q, after the trans-
based query decomposition approach, defined assum-             formation, matches the resulting PGQP. An example
ing to: (i) select only one PGQP (thus, |S⊆ | = 1); (ii)       of graph transformation is the function that, given a
merge partial results thought relational operators.            PGQP φ and a query Q, inserts into φ all nodes and
   Several approximate graph matching functions have           edges in Q − Qφ that have not been mapped into φ
been proposed so far which can be used as basis for            nodes by δm .
defining similarity functions (see, e.g., [9, 10] for some        In order to identify the best transformation, a graph
recent papers and [15] for a survey). A relevant exam-         cost function can be defined as follows.
ple, due to its flexibility, is represented by the Similar-
ity Flooding function [11]. Based on the intuition that           Definition 5. A Graph Cost Function for a trans-
similar nodes have similar neighbors, node similar-            formation function T is a function CT : U ACQ ×
ity values are computed relying on information about           U ACQ → R, which, for each graph query Q and PGQP
node neighbors through an iterative fixpoint compu-            φ, assigns a cost value to T (Q, φ).             2
tation. The selection metric, used in Selection Flood-
ing to identify the best mapping, can be used (after             Similarly to the tree editing problem, a graph cost
normalization of values between 0 and 1) as a graph            function quantifies the transformation effort by assign-
similarity function. Details about PGQP definition in          ing a cost to each applied editing operation. As an ex-
terms of the Similarity Flooding function can be found         ample, given a graph query Q and a PGQP φ, cost 1
in [5].                                                        can be assigned to the insertion of each new node and
   According to Definition 3, each PGQP φ divides a            new edge in φ; a cost proportional to their distance
graph query Q into two subqueries, namely Qφ and               can be assigned to each node n in Q which is mapped
Q − Qφ (computed by removing from Q first edges in             into a node v 0 in φ by δm .
   Algorithm 2 presents a simple approach for PGQP          current and should be indeed used for updating the
set update based on the notions introduced above. In        PGQP set? What about the tradeoff between the size
particular, it updates the input set of PGQPs by first      of the PGQP set, the query processing cost, and the
selecting the PGQP leading to the lowest transforma-        effectiveness (in terms of result quality) of the overall
tion cost for the input graph query Q. If such cost is      query approach?
lower than a given threshold, the PGQP set is updated
by replacing the selected PGQP with the PGQP ob-            5.   REFERENCES
tained by the transformation. Otherwise, the PGQP
                                                             [1] P. Barceló. Querying graph databases. In Proc.
set is updated by inserting Q as a new PGQP.
                                                                 of PODS, pp. 175–188. ACM, 2013.
   Example 1. Consider the graph query Q and the             [2] P. Barceló, L. Libkin, and J. L. Reutter.
PGQP set S = {φ1 , φ2 } presented in Figure 2. Con-              Querying graph patterns. In Proc. of PODS, pp.
sider the graph similarity function δm induced by the            199–210. ACM, 2011.
Similarity Flooding algorithm [11] (see [5] for details),    [3] B. Catania et Al. Wearable queries: adapting
a minimum similarity threshold vs = 0.7, and a max-              common retrieval needs to data and users. In
imum cost threshold vc = 3.                                      Proc. DBRank Workshop. ACM, 2013.
   Q is processed as follows. Algorithm 1 first com-         [4] B. Catania and G Guerrini. Approximate
putes the mapping and the similarity between Q and               queries with adaptive processing. Advanced
each φi , relying on δm . Partial approximate subgraph           Query Processing, Intelligent Systems Reference
matches can be generated from Q to both φ1 and φ2 .              Library (36), pp. 237-269. Springer, 2013.
Assume the following graph similarities are computed:        [5] B. Catania, F. De Fino, and G. Guerrini.
δm (Q, φ1 ) = 0.83 and δm (Q, φ2 ) = 0.40. PGQP φ1 is            Recurring retrieval needs in diverse and
selected because it is the most similar to Q. Since sim-         dynamic dataspaces: issues and reference
ilarity is greater than vs , decomposition is applied and        framework, extended version. Technical Report,
the pair ({Qφ1 }, Q − Qφ1 ) is returned for processing.          University of Genoa, Italy, 2016.
   At the end of the processing, Algorithm 2 is used         [6] A. Das Sarma, X. Dong, and A. Halevy.
for updating the PGQP store. Suppose that CT (Q −                Bootstrapping Pay-As-You-Go Data integration
Qφ1 , φ1 ) = 2.56 and CT (Q − Qφ2 , φ2 ) = 3. PGQP φ1            systems.In Proc. of SIGMOD. ACM, 2008.
is selected because of the lowest transformation cost.       [7] W. Fan, X. Wang, and Y. Wu. Answering
Since the transformation cost is lower than vc , PGQP            pattern queries using views. IEEE Trans.
φ1 is updated by inserting all nodes and edges in Q −            Knowl. Data Eng., 28(2):326–341, 2016.
Qφ1 not mapped into φ1 nodes by δm .                    3    [8] F. Goasdoué et Al. View selection in semantic
                                                                 web databases. PVLDB, 5(2): 97–108, 2011.
4.   OPEN ISSUES                                             [9] J. He et Al. Assessing single-pair similarity over
   The work reported in this paper represents just the           graphs by aggregating first-meeting
first step towards the design of approximate query pro-          probabilities. Inf. Syst. 42: 107-122, 2014.
cessing approaches for recurring retrieval needs. Sev-      [10] D. Koutra et Al. DeltaCon: Principled
eral issues still need to be investigated. Some of them          Massive-Graph Similarity Function with
are discussed in what follows.                                   Attribution. TKDD 10(3): 28, 2016
Framework instantiation. In this paper, we consid-          [11] S. Melnik, H. Garcia-Molina, and E. Rahm.
ered a simple notion of approximate subgraph match-              Similarity flooding: A versatile graph matching
ing, defined by relaxing the notion of subgraph iso-             algorithm and its application to schema
morphism. Other approaches can however be con-                   matching. In Proc. of ICDE, pp. 117–128. IEEE,
sidered, which allow more flexible types of approxi-             2002.
mation (see, e.g., [15] for a survey). More effective       [12] Y. Tian and J.M. Patel. TALE: A Tool for
query decomposition approaches should also be de-                Approximate Large Graph Matching. In Proc. of
signed which increase the number of generated sub-               ICDE, pp. 963-972, 2008
queries, together with specific optimality criterias; op-   [13] J. Wang, N. Ntarmos, and P. Triantafillou.
timization approaches should be developed which take             Indexing query graphs to speedup graph query
into account the number and the shape of the identi-             processing. In Proc. of EDBT, pages 41–52.
fied subqueries for selecting the most efficient PGQP-           2016.
based execution plan.
                                                            [14] G. Weikum. Data and knowledge discovery.
Information associated with PGQPs. As already
                                                                 GRDI 2020, 2011.
discussed, several types of information can be associ-
                                                            [15] Y. Wu. Extending graph homomorphism and
ated with PGQPs. For each type of information, spe-
                                                                 simulation for real life graph matching. PhD
cific PGQP-based query processing algorithms should
                                                                 Thesis, 2010.
be devised.
PGQP management. In this paper, we assumed to               [16] K. Zhang and D. Shasha. Simple fast algorithms
deal with an initial set of PGQPs and we only prelim-            for the editing distance between trees and
inarily discuss how to modify it taking into account a           related problems. SIAM Journal on Computing,
new graph query execution. There are however sev-                18(6):1245–1262, 1989.
eral further issues to be taken into account for PGQP       [17] P. Zhao, J. Han. On graph query optimization
management, e.g., how to decide that a query is re-              in large networks. PVLDB, 3(1): 340-351, 2010.

</pre>