Morph-Skyline: Skyline Queries for Virtual
             Knowledge Graph Access

         Marlene Goncalves1,2 , David Chaves-Fraga2 , and Óscar Corcho2
                        1
                        Universidad Simón Bolivar, Venezuela
                                mgoncalves@usb.ve
          2
            Ontology Engineering Group, Universidad Politécnica de Madrid
                    {mgoncalves,dchaves,ocorcho}@fi.upm.es


        Abstract. A skyline set corresponds to the points that are non-dominated
        by any other points in terms of a multi-criteria function, i.e, there is no
        other point with values better than them in the criteria defined in a
        multi-criteria function. Particularly, skyline queries can be used to filter
        the points that best meet a multi-criteria function in decision making
        applications over large datasets. Most of these datasets are represented
        in relational databases under different schemes and data models. Inte-
        grating these datasets in a unified view may be useful for stakeholders to
        make more accurate decisions. As a proof of concept, we have developed
        Morph-Skyline, a virtual knowledge graph open source engine, which
        is able to automatically translate SPARQL skyline queries to equivalent
        ones in SQL. In this paper, we will demonstrate Morph-Skyline function-
        alities, and users will observe transport sector scenarios where Morph-
        Skyline allows specifying preferences by using skyline clause. We also
        evaluate the performance of Morph-Skyline against the state-of-the-art
        skyline methods for skyline processing over knowledge graphs. Evalua-
        tions were performed on Geonames and GTFS datasets of the subway
        network of Madrid with scale from 1 to 1000.

        Keywords: Skyline Queries · OBDA · Query translation · R2RML.


1     Introduction

Skyline queries [1] have gained popularity in decision making applications to
find a set of non-dominated data points (known as the skyline set) in a multi-
dimensional dataset, i.e., a set of points such that none of them is better than the
rest. Formally, a relational database is comprised of tables whose instances can
be seen formally as a set of multi-dimensional points that describe each table in
terms of their attributes. A skyline query consists of a list of numeric attributes
or dimensions, together with the MIN or MAX directives that indicate whether
the values of the corresponding dimension must be minimized or maximized. The
    Copyright c 2020 for this paper by its authors. Use permitted under Creative Com-
    mons License Attribution 4.0 International (CC BY 4.0).
 SELECT DISTINCT ? route ? stop                                           SELECT DISTINCT route_long_name ,
   ? demand ? use                       Line Station Demand     Use                stop_name , demand , use
 WHERE {? trip gtfs : route ? route .
                                                                          FROM stops s1 JOIN stop_times s2
   ? stopTime gtfs : trip ? trip .      5    253   59,429,010 1,267,151   ON s2 . stop_id = s1 . stop_id
   ? stopTime gtfs : stop ? stop .      10   278   65,134,740 1,481,940   JOIN trips t ON t . trip_id = s2 . trip_id
   ? stop ex : use ? use .              7    296   38,497,353   351,971   JOIN routes r ON r . route_id = t . route_id
   ? route ex : demand ? demand .}      1    295   80,042,527 1,536,049   JOIN demands d ON s . route_id = d . route_id
 SKYLINE OF ? demand MAX ,? use MIN     6    259   92,049,658 3,142,010   JOIN uses u ON u . stop_id = s1 . stop_id
                                        6    107   92,049,658 3,906,267   SKYLINE OF demand MAX , use MIN

 (a) SPARQL               skyline                  (b) Skyline                 (c) SQL skyline Query
 Query

Fig. 1: A sample skyline based on the Madrid Metro network. Metro
stations whose services may need to be reorganized in terms of demands per line
and station uses. The station 107 is dominated by the station 259.


answer of a skyline query q corresponds to the points in the multi-dimensional
dataset D that are non-dominated, i.e., all the points p, such that: i) there is no
other point p0 in D with values better or equal than p in all the dimensions of p,
and ii) other points in the skyline are better than p in at least one dimension.
    The problem of efficiently computing the skyline over a database was intro-
duced by Börzsönyi and colleagues [1] and it has been extensively studied in the
literature [4]. This problem has also been studied for Semantic Web. There are
some works related to extensions of SPARQL with qualitative preferences [7],
which are based on a more general operator than the skyline. Finally, [5] pre-
sented a set of client-based algorithms to evaluate skyline queries over knowledge
graphs using standard query interfaces for RDF although they assumed no con-
trol over how the data is stored (e.g., indexes, internal structures, etc). To the
best of our knowledge, existing OBDA engines do not support skyline queries.
OBDA systems are usually focused on the transformation of the original sources
into a global schema and its corresponding materialization but also on query
translation techniques for highly dynamic data sources [8].
    To illustrate the skyline approach, suppose a database containing data from
the Madrid metro system3 following the GTFS (General Transit Feed Specifica-
tion) model4 , a table named demands storing statistics on annual accumulated
demand or the number of users by line, and a table called uses containing the
number of uses or movements within each station where uses are calculated as
the number of entries and exits through the turnstile plus the number of transfers
between lines if it is a station with access to more than one Metro line.
    Consider a skyline query to identify the least used stations within the most
demanded lines. In this scenario, a station can be chosen, if and only if, there is no
other station with a lower use belonging to a line with a higher demand. To select
a station, we must identify the set of all the stations that are non-dominated by
any other station in terms of minimizing station uses and maximizing demands
per line. Following these criteria, a SPARQL skyline is expressed in Fig. 1a and
translated into SQL according to Fig. 1c. The computed skyline is presented
3
    https://www.metromadrid.es
4
    https://developers.google.com/transit/gtfs
in Fig. 1b. The stations 253, 278, 296, 295, and 259 are the non-dominated
ones, i.e., there is no other station s with values better than them in these two
attributes. Additionally, a station s1 dominates a station s2 , if s1 has better
or equal values and at least one better in demand and use than s2 , e.g., the
station 259 dominates the station 107. Based on related work, we devised a
solution where we have developed techniques able to identify the skyline set on
relational databases using an Ontology-Based Data Access (OBDA) approach
[2] and extending the Chebotko et al.’s translation method [2]. Particularly, we
have implemented an extension Chebotko et al.’s method which translates the
preference clause where each variable in the clause is mapped to an attribute
in the database. Our tool, Morph-Skyline5 , receives a skyline SPARQL query,
translates it into SQL according to the defined R2RML mappings, and then
executes the translated SQL query directly into the relational database engine.
Under this approach, our example skyline SPARQL query (Fig. 1a) is translated
to SQL (Fig. 1c) and then executed by a relational database engine.


2     The Morph-Skyline Architecture

At data level, the Morph-Skyline architecture, shown in Fig. 2a, is constituted
by a virtual knowledge graph, a relational schema and the mapping between
them. In this architecture, the user formulates skyline SPARQL queries over the
virtual knowledge graph, which are transformed into the underlying query lan-
guages of the relational database. Fig. 2b depicts the Morph-Skyline architecture
at component-level. The Morph-Skyline receives a skyline SPARQL query and a
set of R2RML mappings. Given a skyline SPARQL query, the parser component
performs its lexical, syntactic and semantic analysis. If the skyline SPARQL
query statement is correct, the Query Rewriter component prepares the query
to facilitate any optimization by the Query Translator. Then, the Query Trans-
lator component uses the mappings to translate the skyline query posed over the
ontology into a skyline SQL query to be executed by the underlying relational
database engine. Subsequently, the query evaluator component receives the sky-
line query rewritten by the Query Translator and executes it over the relational
database engine. Lastly, results obtained by the query evaluator are translated
by the query result writer component to RDF using the rules provided in the
mappings which define how ontological terms are related to terms occurring in
the relational schema.


3     Demonstration of Use Cases

Within the transportation domain, the Madrid Metro web page provides its data
in GTFS format together with statistics such as uses by stations and demands
5
    https://doi.org/10.5281/zenodo.3974178
    (a) Data-level architecture       (b) Component-level architecture

Fig. 2: The Morph-Skyline Architecture. Morph-Skyline receives a Skyline
SPARQL query and outputs the results of executing the SPARQL query over
Morph-RDB. Morph-Skyline translate from SPARQL to SQL to identify the
skyline set which is transformed to RDF as output of a query.


by lines6 in XLS format. We have joined these datasets in conjunction with pre-
calculated distances from Madrid Geonames dataset7 . For each station, we have
pre-computed the closest distance to a school, a market, a shopping center and
a hospital. This dataset was scaled-up over the scale values (5, 10, 50, 100, 500
and 1000) by using VIG8 . For these datasets, we have created an R2RML map-
ping document for accessing each relational table with 81 PredicateObjectMaps
and 14 JoinConditions. In the context of this demo we will show how Morph-
Skyline can be used in the transportation domain where stakeholders perform
SPARQL skyline queries over relational databases, so that the usefulness of sky-
line SPARQL queries can be better understood and demonstrated specifying
preferences as criteria in a SPARQL query. We will also show the impact of the
dataset and skyline sizes on the performance of Morph-Skyline and the methods
based on TPF, brTPF and SkyTPF [5] to evaluate skyline queries over knowl-
edge graphs. To evaluate the methods implemented in [5], we have transformed
the datasets into RDF by means of RDFizer [3] and then, we have converted
them to HDT by using the HDT library 9 . We report the average execution time
in seconds for 5 SPARQL skyline queries varying the number of skyline criteria
from 2 to 6. Each query was executed 5 times in cold mode on a machine with
the following characteristics: 2GHz CPU with 8 cores, 64 GB RAM with Ubuntu
18.04 as its operating system. It is important to highlight that we have had to
eliminate the DISTINCT feature in the SELECT clause to execute the methods
based on TPF, brTPF and SkyTPF since they do not parse SELECT DISTINCT
6
  https://www.metromadrid.es/es/transparencia/proyectos-y-datos\#panel1
7
  https://download.geonames.org/export/dump/
8
  https://github.com/ontop/vig
9
  http://www.rdfhdt.org/manual-of-the-java-hdt-library/
statement. Our results show that Morph-Skyline time is constant as the size of
the dataset increases because the size of the skyline remains small for the differ-
ent sizes of datasets and queries. It is noteworthy that the best case for a skyline
method is when the skyline is small [1]. Unfortunately, the methods based on
TPF and brTPF return empty answers and SkyTPF it outputs an execution
error when they were executed on the original dataset. For the scaled datasets,
the methods based on TPF, brTPF and SkyTPF were not able to execute the
queries because of an execution error. Due to the lack of space, the resources
and results obtained in this real use case are publicly available in the GitHub
repository of the engine10 .

4      Conclusions
Identifying the points in a database that best meet a set of user criteria can
be represented as a skyline-based ranking problem. In our case, we have gone
further by moving into an OBDA context. We have developed a tool that pro-
vides an efficient solution to this problem, allowing data to reside in a relational
database and skyline SPARQL queries to be used. As a proof of concept, we have
developed Morph-Skyline and implemented an algorithm based the Chebotko et
al.’s translation method on top of a relational database. We will demonstrate the
capabilities of our tool as well as the performance of our solution with respect
to state-of-the-art solutions.

References
1. Borzsonyi, S., Kossmann, D., Stocker, K.: The Skyline Operator. In: Proceedings of
   the 17th International Conference on Data Engineering, pp. 421-430. IEEE Com-
   puter Society. Heidelberg, Germany (2001).
2. Chebotko, A., Lu, S., Fotouhi, F.: Semantics preserving SPARQL-to-SQL transla-
   tion. Data Knowl. Eng., vol. 68(10), pp. 973-1000. Elsevier Science Publishers B. V.
   (2009).
3. Iglesias, E., Jozashoori, S., Chaves-Fraga, D., Collarana, D., Vidal, M.E., SDM-
   RDFizer: An RML Interpreter for the Efficient Creation of RDF Knowledge Graphs,
   ACM Intern. Confer. on Information and Knowledge Management, (2020).
4. Kalyvas, C., Tzouramanis, T. A Survey of Skyline Query Processing. Computing
   Research Repository Journal (2017).
5. Keles, I., Hose, K.: Skyline Queries over Knowledge Graphs. In: The Semantic Web,
   pp. 293-310. Auckland, New Zealand (2019).
6. Mandl, S., Kozachuk, O., Endres, M., Kießling, W.: Preference Analytics in EX-
   ASolution. In: Datenbanksysteme für Business, Technologie und Web (BTW), pp.
   613-632. GI. Hamburg, Germany (2015).
7. Patel-Schneider, P., Polleres, A., Martin, D.: Comparative Preferences in SPARQL.
   Knowl. Eng. and Knowl. Management, pp. 289-305. Nancy, France (2018)
8. Sequeda, J., Arenas, M., Miranker, D.: OBDA: Query Rewriting or Materialization?
   In Practice, Both!. In: The Semantic Web - ISWC, pp. 535-551. Trentino, Italy
   (2014).
10
     https://github.com/oeg-upm/morph-skyline