SumMER: Summarizing RDF/S KBs
                      using Machine LEaRning1

              Georgia Eirini Trouli1, Georgia Troullinou2, Lefteris Koumakis2,
                    Nikolaos Papadakis1 and Haridimos Kondylakis1,2
     1
      Department of Electrical and Computer Engineering, Hellenic Mediterranean University,
                                     Heraklion, Crete, Greece
                ddk7@edu.hmu.gr, npapadak@cs.hmu.gr, kondylak@ics.forth.gr
               2
                 Institute of Computer Science, FORTH, Heraklion, Crete, Greece
                            troulin@ics.forth.gr, koumakis@ics.forth.gr


          Abstract. Knowledge graphs have now become common on the web, ranging
          from small taxonomies for categorizing web sites, to large knowledge bases that
          contain a vast amount of structured content. To enable their quick understanding
          and exploration semantic summaries have been proposed. A key issue of struc-
          tural semantic summaries is the identification of the most important nodes.
          Works in the area, usually employ a single centrality measure, capturing a spe-
          cific perspective on the notion of a node’s importance. However, combining mul-
          tiple centrality measures could give a more objective view, on which nodes
          should be selected as the most important ones. In this paper, we present Sum-
          MER, a novel framework that explores machine learning techniques for opti-
          mally combining multiple centrality measures for selecting the most important
          nodes. The experiments performed show the benefit of our approach, effectively
          increasing the quality of the generated summaries.


1         Introduction & Solution

The explosion of the Data Web and the associated Linked Open Data (LOD) initiative
have led to the generation of an enormous amount of RDF datasets that are currently
widely available [1], [2]. These datasets often have extremely complex schemas, which
are difficult to comprehend, limiting the exploitation potential of the information they
contain. Semantic summarization has been recognized as an important tool to facilitate
ontology understanding, further supporting ontology exploration and reuse. A semantic
summary, according to our recent survey [3] is “a compact information extracted from
the original graph, offering a way to extract meaning from data while reducing its size,
and/or a graph, which some application can exploit instead of the original graph” to
perform certain tasks more efficiently like query answering, source selection etc.
   In this paper, we focus on structural summarization methods, which consider first
and foremost the graph structure, in order to generate summaries. Recent, state of the

1
    Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License
     Attribution 4.0 International (CC BY 4.0).
2

art, structural non-quotient summarization methods [3],[4],[5],[6],[7],[8],[9], [10] sep-
arate between the schema and the instance graph of an RDF/S KB, as the schema graph
offers a first natural way to provide an overview of the KB contents – even when the
schema graph is not available, schema discovery tools can be used to discover it [11],
[12]. Then to proceed with the summarization task, state of the art works select the most
important nodes of an RDF/S schema graph, based on an importance measure, and then
link those nodes using various algorithms in order to generate a connected subgraph
out of the original one.
   The problem. For generic graphs, multiple centrality measures have been proposed,
each one perceiving importance using different criteria. However, there is no centrality
measure to dominate them all, and each one is appropriate for different notions of im-
portance over different types of graphs. On the other hand, we have already shown that
several of these centralities interrelate [7], whereas there have been approaches that
exploit graph neural networks for estimating node importance [13]. Existing ap-
proaches on structural summarization, in most of the cases select a single (or just a few)
centrality measure(s) that produce the best results for selecting the most important
nodes for a specific ontology. However, despite the fact that centrality measures offer
a complementary view on node’s importance, to the best of our knowledge, so far there
is no mechanism able to exploit them all.
   Our solution. We argue that combining multiple such measures could give us an
objective view on which nodes should be selected as the most important ones. To this
direction, in this paper, we present SumMER, effectively exploiting machine-learning
algorithms for optimally combining multiple importance measures for node selection.
To the best of our knowledge, no other approach so far combines structural summari-
zation techniques with machine learning for RDF/S KBs. More specifically, for gen-
erating a summary using SuMMeR, we follow the three steps, shown in Fig. 1. The first
two steps are trying to identify the top-k most important schema nodes, whereas the last
one focuses on linking the selected schema nodes, possibly introducing additional
nodes to the schema summary.
   Selecting top-k nodes. The first step in identifying the top-k nodes in GS is to calcu-
late for each node its importance in the graph. As already mentioned, multiple graph
centrality measures have been proposed in the literature, each one capturing a different
perspective on the node’s importance. In this work we do not try to identify an optimal


                                Fig. 1. SuMMeR’s workflow.
                                                                                      3

centrality measure, as we believe that they offer different perspectives on a node’s im-
portance and that ideally, they should be all considered for assessing a node’s im-
portance. To this direction we exploit a diverse set of centrality measures that we cal-
culate for each node (i.e. degree, bridging, harmonic, radiality, ego, betweenness, Pag-
eRank and HITS), shown in Fig.1. Note that many of those measures correlate as we
have already shown [7] and as such within this step we select the ones not correlated
(in bold) to be further exploited as features for the subsequent machine learning phase.
   To explore the combination of multiple centrality measures for identifying the top-k
schema nodes, we model the problem as a regression problem, trying to rank all schema
nodes for selecting the top-k ones. In this paper, we explore the following machine
learning algorithms: Adaboost regressor, Gradient Boosting regressor, Extra Tree re-
gressor, Random Forest regressor, Linear regression, Decision Tree regressor, Bayesian
Ridge and ElasticNet. As such for each schema node we construct a vector with the
selected centrality measures as features, trying to identify the top-k most important
nodes.
   Linking schema nodes. Independent of the way the most important nodes are se-
lected, the next step is to link those nodes formulating a connected schema subgraph.
Similarly, to [7], we perceive this problem as a variation of the well-known Graph Stei-
ner-Tree problem, trying to minimize the additional nodes introduced for connecting
the top-k most important nodes.


2      Preliminary Evaluation

Next, we present an overview of the datasets used and the methodology for our exper-
imentally evaluating the constructed summaries.
   Datasets. For evaluating our approach, we use DBpedia v3.8, DBpedia v3.9, and the
Semantic Web Dog Food (SWDF). For those versions we also have available query
logs containing 50K user queries for v3.8, 110K user queries for v3.9 and 2.5K user
queries, for SWDF provided by LSQ (https://aksw.github.io/LSQ/) that we exploit for
evaluation as we shall see in the sequel. Table 1 summarizes the characteristics of the
three ontology versions we use for our evaluation.

                            Table 1. Ontology Characteristics.
                       Classes          Properties       User Queries      Storage
 DBPedia 3.8            315               1323               50K           103 GB
 DBPedia 3.9            497               1805              110 K          114 GB
 SWDF                   120                72               2.5 K          50 MB

   Competitors. As ML has not been previously used for generating structural sum-
maries we compare our approach with RDFDigest+, the latest approach for generating
structural summaries that has been shown to outperform past approaches [5], [7].
   Constructing a “golden standard”. In order to construct a “golden standard” for
the most important nodes, and to evaluate the regression models, we exploit the query
logs for the three available ontology versions, calculating the schema nodes that are
4

more frequently queried. We assess as the most important, the ones that have a higher
frequency of appearance in the queries.
    Metrics. For evaluating the performance of our machine learning algorithms, we
used Mean Absolute Error (MAE) as commonly used for evaluating regression prob-
lems. However, note that as we are only looking for the top-k nodes, we evaluate those
metrics on the aforementioned k nodes only. In addition, we calculate for each summary
its coverage, i.e. we calculate for each query the percentage of the classes and properties
that are included in the summary. Having the percentages of the classes and properties
included in the summary, the query coverage is the weighted sum of these percentages.
As our summaries are node based we give 0.8 weight to the percentage of the classes
and 0.2 weight to the percentage of the properties.
    Experiments. For the evaluation of the node selection of the various algorithms, we
attempt to predict the top 10%, 15%, 20%, 25%, 30% of nodes to be included in the
summary. For all the experiments, we use the DBpedia v3.9 as the training dataset and
the DBpedia v3.8 and SWDF as the test datasets. We perform a feature selection step
where we select the non-correlated centrality measures (betweenness, radiality, page
rank, hits and instances), we train the selected models on DBpedia v3.9 and then we
evaluate the train versus test set. We use 10-fold cross validation for the training dataset.


Fig. 2. MAE and Coverage for testing on DBpedia 3.8 and SWDF.

   The results are shown in Fig. 2, and as presented, the Decision Tree regressor per-
forms best in almost all cases, whereas most of the algorithms show a relatively good
performance. Looking at the confusion matrices (not presented here due to lack of
space) we can identify that the Decision Tree regressor is able to predict best the true
positives in all summary sizes, outperforming among others the RDFDigest+ in all
cases. The good performance on selecting the top-k nodes is also depicted in the sub-
sequent calculation of the coverage for both DBpedia 3.8 and SWDF. As shown in Fig.
2 (right) SumMER is always better than RDFDigest+.


3      Conclusions

Overall, the results show that our approach is able to generate better summaries, that
are able to answer more query fragments than previous works. This is true, not only in
subsequent versions of the same ontology (DBpedia), but also in completely different
ontologies (SWDF), showing that our approach is able to generalize into semantic
graphs with different structure. Overall, Decision Trees Regression has been identified
as the best performing algorithm with stability over the different KBs used. For future
                                                                                                5

work, we intend to exploit machine-learning methods learning to rank [14], and also
explore methods for personalizing summaries based on user input. An interesting idea
would be also to explore deep learning methods for generating summaries.


Acknowledgements

This research project was supported by the Hellenic Foundation for Research and In-
novation (H.F.R.I.) under the “2nd Call for H.F.R.I. Research Projects to support Post-
Doctoral Researchers” (iQARuS Project No 1147).

References
 1. Troullinou, G., Kondylakis, H., Lissandrini, M., Mottin, D.: SOFOS: Demonstrating the
    Challenges of Materialized View Selection on Knowledge Graphs, SIGMOD (2021), 2789–
    2793.
 2. Agathangelos, G., Troullinou, G., Kondylakis, H., Stefanidis, K., Plexousakis, D: RDF
    Query Answering Using Apache Spark: Review and Assessment. ICDE Workshops (2018)
    54-59.
 3. Cebiric, S., Goasdoué, F., Kondylakis, H., et al.: Summarizing semantic graphs: a survey.
    The VLDB Journal (2019), 28(3), 295-327.
 4. Pouriyeh, S., Allahyari, M., Liu, Q., et al.: Ontology Summarization: Graph-Based Methods
    and Beyond. Int. J. Semantic Comput. (2019),13(2), 259-283.
 5. Troullinou, G., Kondylakis, H., Stefanidis, K., Plexousakis, D.: Exploring RDFS KBs Using
    Summaries. International Semantic Web Conference (2018), 268-284.
 6. Troullinou, G., Kondylakis, H., Stefanidis, K., Plexousakis, D.: RDFDigest+: A Summary-
    driven System for KBs Exploration. International Semantic Web Conference (P&D/Indus-
    try/BlueSky), 2018.
 7. Pappas, A., Troullinou, G., Roussakis, G., Kondylakis, H., Plexousakis, D.: Exploring Im-
    portance Measures for Summarizing RDF/S KBs. In ESWC, 2017, 387-403.
 8. Vassiliou, G., Troullinou, G., Papadakis, N., Stefanidis, K., Pitoura, E., Kondylakis, H: Cov-
    erage-Based Summaries for RDF KBs. ESWC (Satellite Events) 2021: 98-102.
 9. Troullinou, G., Kondylakis, H., Lissandrini, M., Mottin, D.: SOFOS: Demonstrating the
    Challenges of Materialized View Selection on Knowledge Graphs. SIGMOD Conference
    2021: 2789-2793.
10. Vassiliou, G., Troullinou, G., Papadakis, N., Kondylakis, H.: WBSum: Workload-based
    Summaries for RDF/S KBs. SSDBM (2021).
11. Kellou-Menouer, K., Kardoulakis, N., Troullinou, G. et al.: A survey on Semantic Schema
    Discovery, The VLDB Journal (2021) (in press).
12. Kardoulakis, N., Kellou-Menouer, K., Troullinou, G., et al.: HInT: Hybrid and Incremental
    Type Discovery for Large RDF Data Sources. SSDBM (2021), 97–108.
13. Park, N., Kan, A., Dong, X.L., Zhao, T., Faloutsos, C.: Estimating Node Importance in
    Knowledge Graphs Using Graph Neural Networks. KDD, 2019, 596-606.
14. Learning to Rank, Available online: https://en.wikipedia.org/wiki/Learning_to_rank (visited
    May 2020)