<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Exploring RDF Graphs through Summarization and Analytic Query Discovery</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ioana Manolescu ioana.manolescu@inria.fr Inria</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Institut Polytechnique de Paris Palaiseau</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>France</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>Graph data is central to many applications, ranging from social networks to scientific databases. Graph formats maximize the lfexibility ofered to data designers, as they are mostly schemaless and thus can be used to capture very heterogeneous-structure content. RDF, the W3C's format for sharing open (linked) data, adds the possibility to attach semantics to data, describing applicationdomain constraints by means of ontologies; in turn, this leads to implicit data that is also part of a graph even if it is not explicitly in it. In this paper, we present a structured walk through the problem of analyzing and exploring RDF graphs by finding groups of structurally similar nodes, and by automatically identifying interesting aggregates theirein. We outline the challenges raised by such processing in large, complex RDF graphs, outline the basic principles behind existing solutions, and highlight opportunities for future research.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>Graph data is increasingly popular, thanks to the flexibility it
allows to its designers: it enables representing varying-structure
entities together with their rich attributes and the relationships
interconnecting them.</p>
      <p>In particular, RDF graphs are abundantly present on today’s
Web, as RDF is the recommended format for sharing Open Data.
The Linked Open Data Cloud Web site (https://lod-cloud.net/)
lists numerous examples of RDF databases. Nevertheless, the
multiplication of data sources is not suficient to enable the
construction of applications that take advantage of it. An important
obstacle is rooted in the very advantages of RDF: its flexibility
and the heterogeneity it tolerates in the data make it hard for
users to understand what a graph is about, and potentially even
harder to detect what is interesting within the graph.</p>
      <p>Two approaches can be seen for analyzing and exploring a
graph’s content. On one hand, node-focused exploration could
allow for instance users to identify a few nodes and/or edges they
are interested in. This could be achieved by allowing them to
search, e.g., through keywords, or by some statistical analysis, e.g.,
identifying nodes that are somehow outliers, through their
content or through their structural properties. Such fine-granularity
exploration enables gaining detailed knowledge about relatively
small part of the graph. On the other hand, group- or class-focused
exploration seeks to identify interesting subgraphs, or (most
typically) groups of nodes, which are in a certain sense similar or
comparable. The first step is thus to simplify the cognitive task
of getting acquainted with a graph, by reducing it to the
(simpler) task of understanding a smaller, abstract version thereof,
where each group of nodes represents a “class” or “meta-node”.</p>
      <p>Such a broad graph analysis may hide (or obscure within a larger
group) interesting values or outliers, but it has the advantage of
enabling a global, top-down view, which can be gained as one
starts working with the graph.</p>
      <p>The research highlighted below takes this second path. The
problems to be solved are: how to eficiently build meaningful
summaries of large RDF graphs (Section 2); and how to analyze
and explore RDF graphs by means of aggregate queries
(Section 3). Each problem raises specific conceptual and algorithmic
challenges; we motivate the solutions we found, and point to
interesting areas where the work could continue.</p>
      <p>
        Are node groups an interesting metaphor for exploring RDF
graphs? Figure 1 (from [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]) tends to suggest it. It depicts the
properties of the subjects in a graph describing publications listed
in the DBLP server. Each ring represents the frequency of a given
RDF property among these subjects (or resources). Thus, the
central blue ring reflects the property rdf:type, which clearly
all the subjects have; the second one, dark blue, is date, which
publications have, but authors do not; we can see a set of other
properties present on almost all publications (their frequency
diminishing as we move away from the center of the graph),
while another set of resources have the name property but none
of the properties that publications have; these are the authors.
      </p>
    </sec>
    <sec id="sec-2">
      <title>SUMMARIZING RDF GRAPHS THROUGH</title>
    </sec>
    <sec id="sec-3">
      <title>STRUCTURAL QUOTIENTS</title>
      <p>
        The problem of summarizing RDF graphs has been extensively
studied, in particular drawing upon ideas and solutions proposed
for summarizing generic graphs, or XML documents; RDF
summarization approaches are surveyed in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. A brand of summaries
well-established in database research is that of structural
quotients: an equivalence relation is identified between the nodes
of a graph, typically based on their incoming/outgoing edges.
A quotient summary node has one node per equivalence class,
and one edge between two summary nodes if and only if a
corresponding edge connects, in the original graph, a pair of nodes
they represent. Quotient summaries have been introduced as basis
for structural indexing, in OEM databases [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] and subsequently
for XML, e.g. [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ].
      </p>
      <p>
        The choice of an equivalence relation, thus, fully determines
a summary. Which equivalence relation to pick? In [
        <xref ref-type="bibr" rid="ref10 ref4">4, 10</xref>
        ], we
have proposed two novel relations, based on the transitive closure
of sharing incoming, respectively, outgoing properties. Thus, if
n1, n2 both have titles, n2 and n3 both have authors, while n3, n4
both have publication years, we say n1 to n4 share the same
outgoing property clique, comprising the properties (edge labels) “title”,
“author”, and “year”. This outgoing clique (set of properties) is
deifned based on their (transitively) co-occurring on common nodes.
Observe that n1 and n4 may be quite diferent from each other;
in particular, they may have no property in common. Incoming
property cliques are symetrically defined. Based on the notions of
incoming, respectively, outgoing property cliques, we introduce
two notions of equivalence, so-called weak and strong [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], and
show that they lead to very compact summaries of RDF graphs for
which previously proposed quotients lead to summaries having
more nodes by orders of magnitude. Figure 2 illustrates this: the
summary of a BSBM graph of 100 million triples has only 5 nodes
and 11 edges, the size of a relatively simple Entity-Relationship
diagram.
      </p>
      <p>What sets the summarization of RDF graphs apart from related
graph summarization problems? Several features concur.</p>
      <p>First, RDF nodes may have types; this is encoded by graph
edges, connecting the typed nodes to special kinds of resources
in the graph, namely the type nodes themselves. A node may
have zero, one, or more types, which may or may not be
logically connected. Further, some nodes in a graph may have types,
while others lack them. Types complicate summarization, since
on one hand, they encapsulate precious application knowledge
when present, but on the other hand, summarization must be
able to make sense of a graph even in their absence. Therefore,
we have distinguished data-first summarization, which groups
nodes according to their types first and foremost, and then carries
the type of a node to its representative in the summary. This is
suited for graphs where types are mostly absent, or not suficient
to distinguish classes of nodes from each other. The opposite
strategy is type-first ; it groups nodes by their types, and only
uses property cliques to diferentiate between the untyped ones.
Depending on the graph, data-first or type-first summarization
may be more suitable in order to produce summaries easy to
understand.</p>
      <p>
        A second, more subtle aspect is due to the presence of an
ontology, which may make part of the graph implicit, that is,
triples may hold in the graph, which are not explicitly present
there. In this case, summarizing the graph of explicit triples may
not account for the implicit ones. We have proposed in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] a
suficient condition under which one can compute the summary
of a saturated graph (including all its implicit and explicit data),
without actually saturating the graph; we also show that our
Weak and Strong summaries, in their data-first incarnation,
satisfy this condition, whereas any type-first summarization does
not.
      </p>
      <p>
        The summaries we devised, like many others, strive to separate
nodes of a large graph in groups that simplify its understanding.
Compared with other works, our goal has been to facilitate
understanding at first sight the major groups of nodes in a graph. We
made the hypothesis that accepting “transitively similar” nodes
in a same group allows identifying such groups; our experiments
bear out this claim. Another strong advantage of our summaries
is that they can be all built in time linear in the size of the
input [
        <xref ref-type="bibr" rid="ref10 ref9">9, 10</xref>
        ], including in incremental mode, that is, deriving the
summary equivalence relation and summarizing the graph at the
same time.
      </p>
      <p>The compactness of our summaries comes at a cost of precision.
For instance, they provide very poor support for indexing, since
they are unable to guarantee that graph nodes represented by a
certain summary node have, a certain property. More generally,
they (and any other quotient summaries) reflect the structure,
but not the values (leaf nodes) present in the graph.
3</p>
    </sec>
    <sec id="sec-4">
      <title>EXPLORING RDF GRAPHS BY MEANS OF</title>
    </sec>
    <sec id="sec-5">
      <title>INTERESTING AGGREGATES</title>
      <p>While aggregation is well-established as a way to analyze,
aggregate and summarize relational data, the very meaning of
aggregation has been slow-coming for graphs, and in particular for
RDF. In March 2013, the SPARQL 1.1 specification introduced a
Group-By primitive together with aggregation operators; their
semantics is essentially lifted from the relational database world,
and applied to the tuples of bindings resulting from the matches
of a Where SPARQL block. Below we outline a path we started
from devising an RDF counterpart to relational (data warehouse)
relational queries, formalizing RDF analytical (aggregate) queries
(Section 3.1), and (in subsequent, currently ongoing work)
exploring RDF graphs by automatically identifying interesting
aggregates (Section 3.2).
3.1</p>
    </sec>
    <sec id="sec-6">
      <title>RDF aggregate queries</title>
      <p>
        Our research [
        <xref ref-type="bibr" rid="ref1 ref6">1, 6</xref>
        ] considered, at about the same time, RDF
aggregation at a conceptual: what should an RDF analytical query
look like? The well-known concepts of facts, dimension and
measure from the relational literature hardly fit. To start with,
RDF graphs lack a previously-defined schema, and thus the facts
at the heart of analytical processing are not defined; irregularity
in the data may lead to a dimension or measure being absent, or
being multiply defined. We proposed to define RDF analytical
(aggregate) queries as a combination of a fact query, defining the
set of resources to be treated like facts and analyzed together, a
set of dimension queries, associating to each fact zero or more
values against each dimension, a measure query, specifying what
to use as a measurable property of each fact, finally an aggregation
function among the usual ones (sum, max, average etc.) A sample
aggregate query can be composed as follows:
• Facts are all the articles published between 2000 and 2020;
• A dimension is a country to which an authors’ institutions are
afiliated; many papers have authors from multiple countries,
naturally leading to multiple values for a dimension;
• Another dimension is the year;
• A measure of a paper is a keyword in the paper abstract;
• The aggregation function counts the diferent keywords.
The analytical query described above groups papers by the year
and author country, and for each paper group, it counts the
keywords associated to papers published in that years with an
author from that country.
      </p>
      <p>As this example shows, a fact contributes to the answer of
an RDF analytical query if it has values for all dimensions and
for the measure; a fact may contribute to several cells, if it has
multiple values for one or several dimensions. This flexible model
has numerous advantages for analyzing RDF graphs:
• There may be several fact sets in an RDF graph. One could, for
instance, in a publication dataset, consider the articles to be the
facts, and aggregate them according to their topics, their year
of publication etc.; on the opposite - or rather, at the same time
- one could consider the authors to be the facts, and articles (or
the articles’ years, or topics, or venues) as dimensions.
• As explained above, it flexibly accomodates the absence of
a dimension or a measure, as well as their possible multiple
values.</p>
      <p>
        The formal semantics of such analytical queries [
        <xref ref-type="bibr" rid="ref1 ref6">1, 6</xref>
        ] is
compatible with the SPARQL 1.1. aggregation semantics; the latter,
however, is only concerned with the syntactic level, not with the
more conceptual one where facts, dimensions and measures are
specified.
3.2
      </p>
    </sec>
    <sec id="sec-7">
      <title>Automatically identifying interesting</title>
    </sec>
    <sec id="sec-8">
      <title>RDF aggregates</title>
      <p>As previously explained, RDF analytical queries enable
expressing a large set of questions which enable characterizing, in a
lfexible manner, the nodes of an RDF graph. But what queries to
ask?</p>
      <p>
        A well-explored branch of research in relational data analytics
concerns the automated identification of interesting analytical
queries [
        <xref ref-type="bibr" rid="ref17 ref18 ref19">17–19</xref>
        ]. These works are placed in a typical relational
data warehouse scenario, where a large number of dimensions
exist, and seek to automatically proposed to the users the
analytical queries that are likely to bring them most insight. For
instance, in [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], a query is interesting (brings a useful insight) if
it exhibits, on a subset of the facts, a trend that is diferent from
the one that holds on the complete fact set.
      </p>
      <p>
        In our Dagger project [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], we initiated an approach to
automatically identify interesting analytical queries in RDF graphs.
This was based on a set of simple choices:
• Chosing as facts all nodes of a given RDF type, or, alternatively,
asking users to specify the fact query;
• Chosing as dimensions the properties that suficiently many
facts have, and whose number of distinct values does not
exceed a certain threshold; we also introduced derived properties,
such as the number of authors that a paper has, which we treat
like a new propertu attached to the paper fact;
• Chosing a measure among the other (original or derived)
properties of the facts;
• Considering an aggregate interesting if it maximizes a certain
statistical measure of the aggregate query result.
      </p>
      <p>Figure 3 illustrates the kinds of aggregates Dagger identified,
in a set of DBLP publications from 1936 to 2006. At the top, the
average number of authors of a published paper; we see the
rise of co-authorship along the years. At the center, the number
of published papers grouped by year; this graph really gives
lfesh to the concern that as an academic community we may be
publishing too much! Last but not least, the graph at the bottom
counts the books listed in DBLP and grouped by their publisher.
The dominating bar corresponds to Springer; Infix Verlag comes
second, and a set of bars at the left of the graph show diferent</p>
    </sec>
    <sec id="sec-9">
      <title>Scaling up the exploration of interesting aggregates</title>
      <p>
        Dagger identifies interesting aggregates through exhaustive
search: it explores and evaluates aggregates subject to a given
time limit, before returning the most interesting ones. This made
its exploration process lengthy. We explored in [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] the use of
sampling, both to select the dimensions and measures to use for
the facts, and to decide which aggregates are interesting. While
this did reduce the running time, it provided no guarantee of the
accuracy of the exploration thus abridged.
      </p>
      <p>
        In the domain of relational data analytical processing, a key
ingredient to the automated selection of interesting queries is
the ability to explore many candidates, and discard as early as
possible those queries which can be determined quickly enough
to be not suficiently interesting. Online aggregation, pioneered
in [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] has been a crucial ingredient here: it allows to derive,
while aggregate queries are being computed, an approximation
(with a given confidence interval) of these queries’ results. We
have explored that path in Spade [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], our follow-up project on
Dagger, where we make several new contributions:
• We enlarge the exploration space to multidimensional (not just
mono-dimensional) aggregates;
• We introduce more derived properties, for instance by means of
topic extractions from text; also, moving toward the generality
of the analytical queries introduced in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] we allow dimensions
and measures to be defined by paths of a certain length starting
in the facts;
• To cope with the expensive exploration of multidimensional
aggregates while remaining eficient, we have devised a novel
version of a well-known algorithm [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ], capable of
evaluating in a single pass all the aggregates determined by a set of
dimensions, a measure and an aggregation function;
• Still toward the goal of scalability, we have devised novel
earlystop techniques, capable of estimating the interestingness of an
aggregation query while it is computed, and stop the
computation as soon as it becomes clear that other aggregates, whose
computation is ongoing, are more interesting.
      </p>
    </sec>
    <sec id="sec-10">
      <title>CONCLUSION AND OUTLOOK</title>
      <p>
        The field of graph analytics is by nature very broad, given the
extreme diversity of data modeled as graphs. This paper
summarizes a set of recent work carried with the global goal of helping
users grasp the content of a large and potentially complex RDF
graph. Our key findings can be summarized as follows:
• Identifying interesting node groups is an intuitive first step
toward gaining an understanding of the graph, of its semantics,
structure and content.
• The properties incoming and outgoing RDF resources can be
used as a good basis for identifying such groups, provided that
good measures are taken to avoid the extreme fragmentation
which would result from requiring all nodes in a group to
have exactly the same structure. Instead, summaries such as
we introduced in [
        <xref ref-type="bibr" rid="ref10 ref4 ref9">4, 9, 10</xref>
        ] accept some heterogeneity among
the nodes, which generally leads to easy-to-read summaries.
• If one also takes into account the values, that structural
summaries completely disregard, there are many ways to explore
how groups of nodes in RDF graphs compare among
themselves, and countless combinations of facts, dimensions, and
measures one could use. In Dagger [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and its successor
Spade [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], we are working to identify as quickly as possible
interesting aggregates, with an interesting measure currently
defined as the variance of the set of values that are part of the
aggregate query result.
      </p>
      <p>
        Many avenues for future research are open.
• Personalization, user input, or query by example [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] could be
blended with exploration such as we envisioned it, in order
to help users get as soon as possible to the information they
need for a specific task, in the spirit of [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
• RDF graph semantics has not yet been fully taken into account
in the exploration. It could be incorporated as a facet, or as a
Spade: number of DBLP articles by years and keyword appearing in
their titles. The darker the collor, the fewer articles there are.
way of navigating from one interesting insight to another one,
on a closely (semantically) related set of items.
      </p>
      <p>A related, if more mundane, question is which platform (or
back-end) should best support such analytics; the competition
among (RDF) graph processing platforms is currently hot, with
no clear winner in sight. While many contenders exist, the very
diferent kinds of processing envisioned, say, in Semantic Web
integration queries, on one hand, and in social network analysis
with the goal of influence maximization, on the other hand, make
comparisons dificult, and convergence unlikely.</p>
      <p>
        Going beyond exploration of RDF graphs, one could envision
tools blending more strongly extraction of information from
unstructured content, and structured data under one of its many
forms. This kind of graphs are encountered, for instance, when
integrating heterogeneous data sources such as those available
to journalists. We have outlined such a graph-based integration
framework in the ConnectionLens [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] system. Such
heterogeneous graphs exhibit even more structural and content
heterogeneity; higher-levels abstraction methods are needed as a
ifrst step towards facilitating their understanding [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. We plan
to continue work on these topics, within the ANR SourcesSay
project (2020-2024).
      </p>
      <p>Acknowledgments This research has been partially funded by
ANR-16-CE23-0010-01 and the H2020 research program under
grant agreement nr. 800192. supported</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Elham</given-names>
            <surname>Akbari-Azirani</surname>
          </string-name>
          , François Goasdoué, Ioana Manolescu, and
          <string-name>
            <given-names>Alexandra</given-names>
            <surname>Roatis</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Eficient OLAP Operations For RDF Analytics</article-title>
          .
          <source>In International Workshop on Data Engineering meets the Semantic Web (DESWeb)</source>
          . Seoul, South Korea. https://doi.org/10.1109/ICDEW.
          <year>2015</year>
          .7129548
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Irène</given-names>
            <surname>Burger</surname>
          </string-name>
          , Ioana Manolescu, Emmanuel Pietriga, and
          <string-name>
            <given-names>Fabian</given-names>
            <surname>Suchanek</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Toward Visual Interactive Exploration of Heterogeneous Graphs</article-title>
          . (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Sejla</given-names>
            <surname>Cebiric</surname>
          </string-name>
          , François Goasdoué, Haridimos Kondylakis, Dimitris Kotzinos, Ioana Manolescu, Georgia Troullinou, and
          <string-name>
            <given-names>Mussab</given-names>
            <surname>Zneika</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Summarizing Semantic Graphs: A Survey</article-title>
          .
          <source>The VLDB Journal 28</source>
          ,
          <issue>3</issue>
          (
          <year>June 2019</year>
          ). https: //hal.inria.fr/hal-01925496
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Šejla</given-names>
            <surname>Čebirić</surname>
          </string-name>
          , François Goasdoué, and
          <string-name>
            <given-names>Ioana</given-names>
            <surname>Manolescu</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Query-Oriented Summarization of RDF Graphs</article-title>
          .
          <source>In Proceedings of the VLDB Endowment</source>
          , Vol.
          <volume>8</volume>
          .
          <string-name>
            <surname>Kohala</surname>
            <given-names>Coast</given-names>
          </string-name>
          , Hawaii, United States. https://hal.inria.fr/hal-01178140
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Camille</given-names>
            <surname>Chanial</surname>
          </string-name>
          , Rédouane Dziri, Helena Galhardas, Julien Leblay, MinhHuong Le Nguyen, and
          <string-name>
            <given-names>Ioana</given-names>
            <surname>Manolescu</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>ConnectionLens: Finding Connections Across Heterogeneous Data Sources</article-title>
          .
          <source>Proceedings of the VLDB Endowment (PVLDB) 11</source>
          (
          <year>2018</year>
          ),
          <article-title>4</article-title>
          . https://doi.org/10.14778/3229863.3236252
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Dario</given-names>
            <surname>Colazzo</surname>
          </string-name>
          , François Goasdoué, Ioana Manolescu, and
          <string-name>
            <given-names>Alexandra</given-names>
            <surname>Roatis</surname>
          </string-name>
          .
          <year>2014</year>
          . RDF Analytics:
          <article-title>Lenses over Semantic Graphs</article-title>
          .
          <source>In 23rd International World Wide Web Conference</source>
          . Seoul, South Korea. https://doi.org/10.1145/2566486. 2567982
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Yanlei</given-names>
            <surname>Diao</surname>
          </string-name>
          , Pawel Guzewicz, Ioana Manolescu, and Mirjana Mazuran. [n. d.].
          <article-title>Spade: A Modular Framework for Analytical Exploration of RDF Graphs</article-title>
          .
          <source>In VLDB 2019 - 45th International Conference on Very Large Data Bases</source>
          . https: //hal.inria.fr/hal-02152844
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Yanlei</given-names>
            <surname>Diao</surname>
          </string-name>
          , Ioana Manolescu, and
          <string-name>
            <given-names>Shu</given-names>
            <surname>Shang</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Dagger: Digging for Interesting Aggregates in RDF Graphs</article-title>
          . In
          <source>International Semantic Web Conference (ISWC)</source>
          . Vienna, Austria. https://hal.inria.fr/hal-01577464
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>François</given-names>
            <surname>Goasdoué</surname>
          </string-name>
          , Pawel Guzewicz, and
          <string-name>
            <given-names>Ioana</given-names>
            <surname>Manolescu</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Incremental structural summarization of RDF graphs</article-title>
          .
          <source>In EDBT 2019 - 22nd International Conference on Extending Database Technology</source>
          . Lisbon, Portugal. https://hal. inria.fr/hal-01978784
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>François</surname>
            <given-names>Goasdoué</given-names>
          </string-name>
          , Paweł Guzewicz, and
          <string-name>
            <given-names>Ioana</given-names>
            <surname>Manolescu</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>RDF graph summarization for first-sight structure discovery</article-title>
          .
          <source>The VLDBJ Journal</source>
          (
          <year>2020</year>
          ). To appear.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Roy</given-names>
            <surname>Goldman</surname>
          </string-name>
          and
          <string-name>
            <given-names>Jennifer</given-names>
            <surname>Widom</surname>
          </string-name>
          .
          <year>1997</year>
          .
          <article-title>DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases</article-title>
          .
          <source>In Proceedings of 23rd International Conference on Very Large Data Bases</source>
          ,
          <year>1997</year>
          , Athens, Greece.
          <fpage>436</fpage>
          -
          <lpage>445</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Joseph</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Hellerstein</surname>
          </string-name>
          ,
          <string-name>
            <surname>Peter J. Haas</surname>
            , and
            <given-names>Helen J.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
          </string-name>
          .
          <year>1997</year>
          .
          <article-title>Online Aggregation</article-title>
          . In SIGMOD.
          <volume>171</volume>
          -
          <fpage>182</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Matteo</surname>
            <given-names>Lissandrini</given-names>
          </string-name>
          , Davide Mottin, Themis Palpanas, and
          <string-name>
            <given-names>Yannis</given-names>
            <surname>Velegrakis</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Data Exploration Using Example-Based Methods</article-title>
          . Morgan &amp; Claypool Publishers. https://doi.org/10.2200/S00881ED1V01Y201810DTM053
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Ioana</given-names>
            <surname>Manolescu</surname>
          </string-name>
          and
          <string-name>
            <given-names>Mirjana</given-names>
            <surname>Mazuran</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Speeding up RDF aggregate discovery through sampling</article-title>
          .
          <source>In BigVis 2019 - 2nd International Workshop on Big Data Visual Exploration and Analytics</source>
          . Lisbon, Portugal. https://hal.inria. fr/hal-02065993
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Tova</given-names>
            <surname>Milo</surname>
          </string-name>
          and
          <string-name>
            <given-names>Dan</given-names>
            <surname>Suciu</surname>
          </string-name>
          .
          <year>1999</year>
          .
          <article-title>Index structures for path expressions</article-title>
          .
          <source>In International Conference on Database Theory</source>
          . Springer,
          <fpage>277</fpage>
          -
          <lpage>295</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Amit</surname>
            <given-names>Somech</given-names>
          </string-name>
          , Tova Milo, and
          <string-name>
            <given-names>Chai</given-names>
            <surname>Ozeri</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Predicting "What is Interesting" by Mining Interactive-Data-Analysis Session Logs</article-title>
          .
          <source>In Advances in Database Technology - 22nd International Conference on Extending Database Technology, EDBT</source>
          <year>2019</year>
          , Lisbon, Portugal,
          <year>March 2019</year>
          .
          <fpage>456</fpage>
          -
          <lpage>467</lpage>
          . https://doi.org/10.5441/ 002/edbt.
          <year>2019</year>
          .42
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Bo</surname>
            <given-names>Tang</given-names>
          </string-name>
          , Shi Han,
          <source>Man Lung Yiu</source>
          , Rui Ding,
          <string-name>
            <given-names>and Dongmei</given-names>
            <surname>Zhang</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Extracting Top-K Insights from Multi-dimensional Data</article-title>
          .
          <source>In SIGMOD</source>
          .
          <volume>1509</volume>
          -
          <fpage>1524</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Manasi</surname>
            <given-names>Vartak</given-names>
          </string-name>
          , Sajjadur Rahman, Samuel Madden, Aditya G. Parameswaran, and
          <string-name>
            <given-names>Neoklis</given-names>
            <surname>Polyzotis</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>SEEDB: Eficient Data-Driven Visualization Recommendations to Support Visual Analytics</article-title>
          .
          <source>PVLDB 8</source>
          ,
          <issue>13</issue>
          (
          <year>2015</year>
          ),
          <fpage>2182</fpage>
          -
          <lpage>2193</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Yuhao</surname>
            <given-names>Wen</given-names>
          </string-name>
          , Xiaodan Zhu, Sudeepa Roy, and
          <string-name>
            <given-names>Jun</given-names>
            <surname>Yang</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>QAGView: Interactively Summarizing High-Valued Aggregate Query Answers</article-title>
          . In SIGMOD.
          <volume>1709</volume>
          -
          <fpage>1712</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Yihong</surname>
            <given-names>Zhao</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Prasad</given-names>
            <surname>Deshpande</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Jefrey F.</given-names>
            <surname>Naughton</surname>
          </string-name>
          .
          <year>1997</year>
          .
          <article-title>An ArrayBased Algorithm for Simultaneous Multidimensional Aggregates</article-title>
          . In SIGMOD.
          <volume>159</volume>
          -
          <fpage>170</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>