<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>SumMER: Summarizing RDF/S KBs using Machine LEaRning1</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Georgia Eirini Trouli</string-name>
          <email>troulin@ics.forth.gr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Georgia Troullinou</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lefteris Koumakis</string-name>
          <email>koumakis@ics.forth.gr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nikolaos Papadakis</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Haridimos Kondylakis</string-name>
          <email>kondylak@ics.forth.gr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Electrical and Computer Engineering, Hellenic Mediterranean University</institution>
          ,
          <addr-line>Heraklion, Crete</addr-line>
          ,
          <country country="GR">Greece</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Institute of Computer Science</institution>
          ,
          <addr-line>FORTH, Heraklion, Crete</addr-line>
          ,
          <country country="GR">Greece</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Introduction &amp; Solution</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Knowledge graphs have now become common on the web, ranging from small taxonomies for categorizing web sites, to large knowledge bases that contain a vast amount of structured content. To enable their quick understanding and exploration semantic summaries have been proposed. A key issue of structural semantic summaries is the identification of the most important nodes. Works in the area, usually employ a single centrality measure, capturing a specific perspective on the notion of a node's importance. However, combining multiple centrality measures could give a more objective view, on which nodes should be selected as the most important ones. In this paper, we present SumMER, a novel framework that explores machine learning techniques for optimally combining multiple centrality measures for selecting the most important nodes. The experiments performed show the benefit of our approach, effectively increasing the quality of the generated summaries.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        art, structural non-quotient summarization methods [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ],[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ],[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ],[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ],[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ],[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ],[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]
separate between the schema and the instance graph of an RDF/S KB, as the schema graph
offers a first natural way to provide an overview of the KB contents – even when the
schema graph is not available, schema discovery tools can be used to discover it [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ],
[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Then to proceed with the summarization task, state of the art works select the most
important nodes of an RDF/S schema graph, based on an importance measure, and then
link those nodes using various algorithms in order to generate a connected subgraph
out of the original one.
      </p>
      <p>
        The problem. For generic graphs, multiple centrality measures have been proposed,
each one perceiving importance using different criteria. However, there is no centrality
measure to dominate them all, and each one is appropriate for different notions of
importance over different types of graphs. On the other hand, we have already shown that
several of these centralities interrelate [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], whereas there have been approaches that
exploit graph neural networks for estimating node importance [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Existing
approaches on structural summarization, in most of the cases select a single (or just a few)
centrality measure(s) that produce the best results for selecting the most important
nodes for a specific ontology. However, despite the fact that centrality measures offer
a complementary view on node’s importance, to the best of our knowledge, so far there
is no mechanism able to exploit them all.
      </p>
      <p>Our solution. We argue that combining multiple such measures could give us an
objective view on which nodes should be selected as the most important ones. To this
direction, in this paper, we present SumMER, effectively exploiting machine-learning
algorithms for optimally combining multiple importance measures for node selection.
To the best of our knowledge, no other approach so far combines structural
summarization techniques with machine learning for RDF/S KBs. More specifically, for
generating a summary using SuMMeR, we follow the three steps, shown in Fig. 1. The first
two steps are trying to identify the top-k most important schema nodes, whereas the last
one focuses on linking the selected schema nodes, possibly introducing additional
nodes to the schema summary.</p>
      <p>
        Selecting top-k nodes. The first step in identifying the top-k nodes in GS is to
calculate for each node its importance in the graph. As already mentioned, multiple graph
centrality measures have been proposed in the literature, each one capturing a different
perspective on the node’s importance. In this work we do not try to identify an optimal
centrality measure, as we believe that they offer different perspectives on a node’s
importance and that ideally, they should be all considered for assessing a node’s
importance. To this direction we exploit a diverse set of centrality measures that we
calculate for each node (i.e. degree, bridging, harmonic, radiality, ego, betweenness,
PageRank and HITS), shown in Fig.1. Note that many of those measures correlate as we
have already shown [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and as such within this step we select the ones not correlated
(in bold) to be further exploited as features for the subsequent machine learning phase.
      </p>
      <p>To explore the combination of multiple centrality measures for identifying the top-k
schema nodes, we model the problem as a regression problem, trying to rank all schema
nodes for selecting the top-k ones. In this paper, we explore the following machine
learning algorithms: Adaboost regressor, Gradient Boosting regressor, Extra Tree
regressor, Random Forest regressor, Linear regression, Decision Tree regressor, Bayesian
Ridge and ElasticNet. As such for each schema node we construct a vector with the
selected centrality measures as features, trying to identify the top-k most important
nodes.</p>
      <p>
        Linking schema nodes. Independent of the way the most important nodes are
selected, the next step is to link those nodes formulating a connected schema subgraph.
Similarly, to [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], we perceive this problem as a variation of the well-known Graph
Steiner-Tree problem, trying to minimize the additional nodes introduced for connecting
the top-k most important nodes.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Preliminary Evaluation</title>
      <p>Next, we present an overview of the datasets used and the methodology for our
experimentally evaluating the constructed summaries.</p>
      <p>Datasets. For evaluating our approach, we use DBpedia v3.8, DBpedia v3.9, and the
Semantic Web Dog Food (SWDF). For those versions we also have available query
logs containing 50K user queries for v3.8, 110K user queries for v3.9 and 2.5K user
queries, for SWDF provided by LSQ (https://aksw.github.io/LSQ/) that we exploit for
evaluation as we shall see in the sequel. Table 1 summarizes the characteristics of the
three ontology versions we use for our evaluation.</p>
      <p>DBPedia 3.8
DBPedia 3.9
SWDF</p>
      <p>
        Competitors. As ML has not been previously used for generating structural
summaries we compare our approach with RDFDigest+, the latest approach for generating
structural summaries that has been shown to outperform past approaches [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>Constructing a “golden standard”. In order to construct a “golden standard” for
the most important nodes, and to evaluate the regression models, we exploit the query
logs for the three available ontology versions, calculating the schema nodes that are
more frequently queried. We assess as the most important, the ones that have a higher
frequency of appearance in the queries.</p>
      <p>Metrics. For evaluating the performance of our machine learning algorithms, we
used Mean Absolute Error (MAE) as commonly used for evaluating regression
problems. However, note that as we are only looking for the top-k nodes, we evaluate those
metrics on the aforementioned k nodes only. In addition, we calculate for each summary
its coverage, i.e. we calculate for each query the percentage of the classes and properties
that are included in the summary. Having the percentages of the classes and properties
included in the summary, the query coverage is the weighted sum of these percentages.
As our summaries are node based we give 0.8 weight to the percentage of the classes
and 0.2 weight to the percentage of the properties.</p>
      <p>Experiments. For the evaluation of the node selection of the various algorithms, we
attempt to predict the top 10%, 15%, 20%, 25%, 30% of nodes to be included in the
summary. For all the experiments, we use the DBpedia v3.9 as the training dataset and
the DBpedia v3.8 and SWDF as the test datasets. We perform a feature selection step
where we select the non-correlated centrality measures (betweenness, radiality, page
rank, hits and instances), we train the selected models on DBpedia v3.9 and then we
evaluate the train versus test set. We use 10-fold cross validation for the training dataset.</p>
      <p>The results are shown in Fig. 2, and as presented, the Decision Tree regressor
performs best in almost all cases, whereas most of the algorithms show a relatively good
performance. Looking at the confusion matrices (not presented here due to lack of
space) we can identify that the Decision Tree regressor is able to predict best the true
positives in all summary sizes, outperforming among others the RDFDigest+ in all
cases. The good performance on selecting the top-k nodes is also depicted in the
subsequent calculation of the coverage for both DBpedia 3.8 and SWDF. As shown in Fig.
2 (right) SumMER is always better than RDFDigest+.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Conclusions</title>
      <p>
        Overall, the results show that our approach is able to generate better summaries, that
are able to answer more query fragments than previous works. This is true, not only in
subsequent versions of the same ontology (DBpedia), but also in completely different
ontologies (SWDF), showing that our approach is able to generalize into semantic
graphs with different structure. Overall, Decision Trees Regression has been identified
as the best performing algorithm with stability over the different KBs used. For future
work, we intend to exploit machine-learning methods learning to rank [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], and also
explore methods for personalizing summaries based on user input. An interesting idea
would be also to explore deep learning methods for generating summaries.
      </p>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgements</title>
      <p>This research project was supported by the Hellenic Foundation for Research and
Innovation (H.F.R.I.) under the “2nd Call for H.F.R.I. Research Projects to support
PostDoctoral Researchers” (iQARuS Project No 1147).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Troullinou</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kondylakis</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lissandrini</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mottin</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>SOFOS: Demonstrating the Challenges of Materialized View Selection on Knowledge Graphs, SIGMOD (</article-title>
          <year>2021</year>
          ),
          <fpage>2789</fpage>
          -
          <lpage>2793</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Agathangelos</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Troullinou</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kondylakis</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stefanidis</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Plexousakis</surname>
            ,
            <given-names>D:</given-names>
          </string-name>
          <article-title>RDF Query Answering Using Apache Spark: Review and Assessment</article-title>
          .
          <source>ICDE Workshops</source>
          (
          <year>2018</year>
          )
          <fpage>54</fpage>
          -
          <lpage>59</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Cebiric</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goasdoué</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kondylakis</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          , et al.:
          <article-title>Summarizing semantic graphs: a survey</article-title>
          .
          <source>The VLDB Journal</source>
          (
          <year>2019</year>
          ),
          <volume>28</volume>
          (
          <issue>3</issue>
          ),
          <fpage>295</fpage>
          -
          <lpage>327</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Pouriyeh</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Allahyari</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          , et al.:
          <source>Ontology Summarization: Graph-Based Methods and Beyond. Int. J. Semantic Comput</source>
          .
          <article-title>(</article-title>
          <year>2019</year>
          ),
          <volume>13</volume>
          (
          <issue>2</issue>
          ),
          <fpage>259</fpage>
          -
          <lpage>283</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Troullinou</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kondylakis</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stefanidis</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Plexousakis</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <string-name>
            <surname>Exploring RDFS KBs Using</surname>
          </string-name>
          <article-title>Summaries</article-title>
          .
          <source>International Semantic Web Conference</source>
          (
          <year>2018</year>
          ),
          <fpage>268</fpage>
          -
          <lpage>284</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Troullinou</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kondylakis</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stefanidis</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Plexousakis</surname>
            ,
            <given-names>D.:</given-names>
          </string-name>
          <article-title>RDFDigest+: A Summarydriven System for KBs Exploration</article-title>
          . International Semantic Web Conference (P&amp;D/Industry/BlueSky),
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Pappas</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Troullinou</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roussakis</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kondylakis</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Plexousakis</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Exploring Importance Measures for Summarizing RDF/S KBs</article-title>
          . In ESWC,
          <year>2017</year>
          ,
          <fpage>387</fpage>
          -
          <lpage>403</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Vassiliou</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Troullinou</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Papadakis</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stefanidis</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pitoura</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kondylakis</surname>
            ,
            <given-names>H:</given-names>
          </string-name>
          <article-title>Coverage-Based Summaries for RDF KBs</article-title>
          .
          <source>ESWC (Satellite Events)</source>
          <year>2021</year>
          :
          <fpage>98</fpage>
          -
          <lpage>102</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Troullinou</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kondylakis</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lissandrini</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mottin</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>SOFOS: Demonstrating the Challenges of Materialized View Selection on Knowledge Graphs</article-title>
          .
          <source>SIGMOD Conference</source>
          <year>2021</year>
          :
          <fpage>2789</fpage>
          -
          <lpage>2793</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Vassiliou</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Troullinou</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Papadakis</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kondylakis</surname>
          </string-name>
          , H.:
          <article-title>WBSum: Workload-based Summaries for RDF/S KBs</article-title>
          . SSDBM (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Kellou-Menouer</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kardoulakis</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Troullinou</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          et al.:
          <article-title>A survey on Semantic Schema Discovery</article-title>
          ,
          <source>The VLDB Journal</source>
          (
          <year>2021</year>
          )
          <article-title>(in press).</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Kardoulakis</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kellou-Menouer</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Troullinou</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          , et al.:
          <article-title>HInT: Hybrid and Incremental Type Discovery for Large RDF Data Sources</article-title>
          .
          <source>SSDBM</source>
          (
          <year>2021</year>
          ),
          <fpage>97</fpage>
          -
          <lpage>108</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Park</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dong</surname>
            ,
            <given-names>X.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Faloutsos</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Estimating Node Importance in Knowledge Graphs Using Graph Neural Networks</article-title>
          .
          <source>KDD</source>
          ,
          <year>2019</year>
          ,
          <fpage>596</fpage>
          -
          <lpage>606</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14. Learning to Rank, Available online: https://en.wikipedia.org/wiki/Learning_to_
          <source>rank (visited May</source>
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>