=Paper=
{{Paper
|id=Vol-2161/paper15
|storemode=property
|title=How Modern Deductive Database Systems Can Enhance Data Integration
|pdfUrl=https://ceur-ws.org/Vol-2161/paper15.pdf
|volume=Vol-2161
|authors=Francesco Calimeri,Simona Perri,Giorgio Terracina,Jessica Zangari
|dblpUrl=https://dblp.org/rec/conf/sebd/CalimeriPTZ18
}}
==How Modern Deductive Database Systems Can Enhance Data Integration==
<pdf width="1500px">https://ceur-ws.org/Vol-2161/paper15.pdf</pdf>
<pre>
    How Modern Deductive Database Systems Can
            Enhance Data Integration

Francesco Calimeri1,2 , Simona Perri1 , Giorgio Terracina1 , and Jessica Zangari1
1
    Department of Mathematics and Computer Science, University of Calabria, Rende,
                                     Italy
               {calimeri,perri,terracina,zangari}@mat.unical.it
                          2
                            DLVSystem Srl, Rende, Italy
                            calimeri@dlvsystem.com


        Abstract. Data integration systems provide a transparent access to het-
        erogeneous, possibly distributed, sources; deductive database and their
        extensions allow to easily address complex issues arising in data inte-
        gration. However, the gap between state-of-the-art deductive databases
        and data integration systems is still to be closed. In this paper we fo-
        cus on some recent advancements implemented in the I-DLV system,
        and point out how these can facilitate the development of advanced data
        integration systems.

        Keywords: Deductive DataBase, Data Integration, Instantiation, An-
        swer Set Programming


1     Introduction
The task of an information integration system is to combine data residing at
different sources, providing the user with a unified view called global schema.
Users can formulate queries in a transparent and declarative way over the global
schema: they do not need to be aware of details about the sources: the informa-
tion integration system automatically retrieves relevant data from the sources,
and suitably combines them to provide answers to user queries [16]. The global
schema may also contain integrity constraints (such as key dependencies, inclu-
sion dependencies, etc.).
    Recent developments in IT, have made available a huge number of informa-
tion sources, typically autonomous, heterogeneous and widely distributed. As a
consequence, information integration has emerged as a crucial issue in several
application domains, e.g., distributed databases, cooperative information sys-
tems, data warehousing, ontology-based data access, or on-demand computing.
Deductive database systems in general, and Answer Set Programming (ASP)
in particular, are powerful tools in this context, as demonstrated, for instance,
    SEBD 2018, June 24-27, 2018, Castellaneta Marina, Italy. Copyright held by the
    author(s).
by the approaches formalized in [3, 4, 17]. More generally, the adoption of logic-
based systems allows to easily address complex problems like Consistent Query
Answering (CQA) [22] and querying ontologies under inconsistencies [15]. The
database community has spent many efforts in this area, and relevant research
results have been obtained to clarify semantics, decidability and complexity of
data-integration systems under different assumptions. However, filling the gap
between deductive database systems and database integration tools is still an
open challenge, and continuous improvements and extensions in ASP systems [9,
14] are certainly important contributions to reach this goal.
    In this paper we discuss some of the most recent database oriented innova-
tions in ASP as implemented in the I -DLV system, and we point out how such
improvements may enhance advanced data integration systems.

2     The I -DLV System
The I -DLV system [7] is a stand-alone modern ASP instantiator and deductive
database engine, that has been also integrated as the grounding module of the
renewed version of the popular system DLV [1]. The description of all the features
of I -DLV is out of the scope of this paper. In the following, we outline the major
features having an important impact on I -DLV as deductive database engine. For
a comprehensive list of customizations and options, along with further details,
we refer the reader to [7, 6] and to the online documentation [10].

2.1   Overview of evaluation features
I -DLV supports the ASP-Core-2 [5] standard language; its high flexibility and
extensible design ease the incorporation of optimization techniques, language
updates and customizability. We provide next a brief overview of its instantiation
process, focusing on peculiar optimizations whose synergic work, that can be
driven at a fine-grained level from the user, is the key of I -DLV efficiency.

Optimizations. The system adopts a bottom-up evaluation strategy based on a
semi-naı̈ve approach [27]. One of the most crucial and computationally expensive
tasks is the grounding of each rule; it resembles the evaluation of the relational
joins among positive body literals, and I -DLV adopts a bunch of techniques to
optimize it, many of which inspired by the database field and properly enhanced
and readapted to I -DLV purposes. Here we mention body-reordering criteria,
indexing strategies and decomposition rewritings [8], along with additional fine-
tuning optimizations acting to different extents on the evaluation process, with
the general common aim of reducing the search space and improving overall
performances [7].
 – Body-reordering techniques aim at finding an optimal execution ordering
   for the join operations, by varying the order of literals in the rule bodies.
   Different orderings have been defined for I -DLV; the one adopted by default
   has been specifically designed by considering the effects of each literal on the
   binding of variables [7].
 – Indexing techniques, instead, are intended to optimize the retrieval of match-
   ing instances from the predicate extensions. I -DLV defines a flexible indexing
   schema: any predicate argument can be indexed, allowing both single- and
   multiple-argument indices, and for each predicate different indexing data
   structures can be built “on-demand”, only if needed, while instantiating a
   rule containing that predicate in its body.
 – I -DLV exploits also a heuristic-guided decomposition rewriting technique
   relying on hyper-tree decompositions that replaces long rules with sets of
   smaller ones, with the aim of transforming the input program into an equiv-
   alent one possibly evaluated more efficiently.
 – Eventually, we cite a series of techniques falling into the category of join
   optimizations, such as “pushing down selections” and other join rewritings;
   they have diverse aims, such as decreasing the number of matches considered
   during rule instantiation, early recovering inconsistencies in the input pro-
   gram, or syntactically rewriting the input program with the twofold intent
   of easing the instantiation and improving performance.
Query answering in I -DLV is empowered with the magic-sets technique [2]: when
the input program features a query, it simulates a top-down computation by
rewriting the input program for identifying the relevant subset of the instantia-
tion which is sufficient for answering the query. Restrictions on the instantiation
is obtained by means of additional “magic” predicates, whose extensions repre-
sent relevant atoms with respect to the query.

Customizability. I -DLV provides a fine-grained control over the whole computa-
tional process, allowing for enabling/disabling each one of the many optimization
techniques both via command-line options and inline annotations. More in de-
tail, I -DLV programs can be enriched by global and local annotations (i.e., on
a per-rule basis), for customizing some machineries such as body ordering and
indexing. For instance, the indexing schema of a specific atom in a rule can be
constrained to satisfy some specific conditions, annotating the rule as follows:
%@rule atom indexed(@atom=a(X,Y,Z), @arguments={0,2}). when instantiating
the annotated rule, the atom a(X, Y, Z) will be indexed, if possible, with a double-
index on the first and third arguments.
    Since its release, I -DLV proved its reliability and efficiency as both ASP
grounder and deductive database engine. Recently, in the latest ASP Compe-
tition [14] I -DLV ranked both as first and second combined with an automatic
solver selector [12] that inductively chooses the best solver depending on some
inherent features of the instantiation produced, and with the state-of-the-art
solver clasp [13], respectively. Moreover, I -DLV performance results are promis-
ing also as deductive database system [7]. The system has been tested on the
query-based set of problems from OpenRuleBench [20], an open set of resources
featuring a suite of benchmarks for analyzing performance and scalability of dif-
ferent rule engines, and compared with the former DLV version and XSB [24],
which was among the clear winners of the official OpenRuleBench runs [20]
and is currently one of the most widespread logic programming and deductive
database systems. Results show that not only I -DLV behaves better than DLV,
but it is definitely competitive against XSB. For a detailed description on such
experiments we refer the reader to [7].

2.2   Interoperability Features
In this section we briefly illustrate mechanisms and tools of I -DLV for (i) in-
teroperability with external systems, (ii) accommodation of external sources
of computation, and (iii) value invention/modification within logic programs.
In particular, I -DLV supports direct connection with relational databases and
SPARQL enabled ontologies via explicit import/export directives, and access to
external data via calls to Python scripts with external atoms.

RDBMS Data Access. I -DLV can import relations from an RDBMS by means of
an #import sql directive. For instance, #import sql(DB, "user", "pass", "SELECT
* FROM t", p) can access database DB and import all tuples from table t into facts
with predicate name p. Similarly, #export sql directives are used to populate
specific tables with the extension of a predicate.

Ontology-Based Data Access. Data can also be imported from local RDF/XML
files and from remote EndPoints via SPARQL queries by means of directives of
form: #import local sparql("file","query",pred name,pred arity[,types]). or
#import remote sparql("endpoint url","query",pred name,pred arity[,types]).
where query is a SPARQL statement defining data to be imported and the
optional types specifies the conversion for mapping data types to ASP-Core-
2 terms. For the local import, file can be either a local or remote URL pointing
to an RDF/XML file: in the latter case, the file is downloaded and treated as a
local RDF/XML file; in any case, the ontology graph is built in memory. As for
the remote import, the endpoint url refers to a remote endpoint and building
the graph is up to the remote server; this second option might be very convenient
in case of large datasets.

Generic Data Access via Python scripts. Input programs can be enriched by
external atoms of the form: &p(i1 , . . . , in ; o1 , . . . , om ), where p is the name of
a Python function, i1 , . . . , in and o1 , . . . , om (n, m ≥ 0) are input and output
terms, respectively. For each instantiation i01 , . . . , i0n of the input terms, func-
tion p is called with arguments i01 , . . . , i0n , and returns a set of instantiations for
o1 , . . . , om . For instance, a single line of Python: def rev(s): s[::-1] is suffi-
cient to define a function rev that reverses strings, and which can be used within
a rule of the following form: revW ord(Y ) :– word(X), &rev(X; Y). External atoms
give the user a powerful tool for significantly extending interoperability, granting
access to virtually unlimited external data sources. Hence, additional import/-
export features to specific semistructured or unstructured data sources can be
externally defined by suitable Python scripts. Obviously, “native” support for
interoperability should be preferred whenever available. In fact, it is intuitive to
understand that native support allows much better performance; experiments
                    PRESENTATION LAYER
                                                           Query          Output

                    EVALUATION LAYER
                        DLV
                          I-DLV
                                                   OPTIMIZER
                                                    Magic Sets
                               DATA PROCESSOR
                                                    Body Orderings
                                 Skolemizer                                                    CQA
                                                    Indexing Strategies                      Rewriter
                                    ETL
                                                     Join Optimizations
                                                     Hypertree Decompositions


                                    SQL           SPARQL       PYTHON


                    DATA LAYER                                          SCHEMA LAYER

                                                                             Source Schema   Constraints
                         RDB           ONTOLOGY
                                                    SEMISTRUCTURED
                                                     DATA SOURCES
                                                                             Global Schema    Mapping


         Fig. 1. Architecture of a data integration system based on I-DLV.


reported in [6] give an idea of the effective gain on performance obtainable with
a native support of SQL/SPARQL local import directives against the same di-
rectives implemented via Python scripts.

Value invention/modification. The availability of both external atoms and func-
tion symbols, included in the ASP-Core-2 compliance, allows to address very
interesting issues from a database perspective. First of all, it is well known that
function symbols allow to implement value invention by skolemization. This
turns out to be a very useful feature when dealing with ontologies. Moreover,
the generality of external atoms allows to include in logic rules data modification
processes, typical of ETL workflows. In [26] we already described how external
atoms may help data cleaning processes in a logic-based scenario.


3   Application of I -DLV Features for Data Integration
The adoption of deductive database technology for data integration solutions is
not new [16, 17, 25, 19, 23]; however, the recent developments on ASP described
in this paper, allow a more concrete application of deductive systems in real-
world applications requiring integration of heterogeneous data such as RDBMS,
Ontologies and Semi-structured information sources. A general architecture for
a modern integration system based on I -DLV is presented in Figure 1, where
both main architectural elements and specific I -DLV functionalities oriented to
data integration are highlighted; these will be described next by layers.
    The Data Layer, which comprises the set of input information sources, can
handle several kind of data types: (i) standard databases can be directly accessed
through the import sql directives included in I -DLV; (ii) graph databases, RDF
ontologies, and more generally SPARQL-enabled ontologies, can be accessed by
the import local sparql and import remote sparql directives; (iii) interoperabil-
ity with any other kind of input format can be granted by external atoms relying
on suitable Python scripts.
     The Schema Layer includes everything that describes the data integration
context from a conceptual point of view, namely source and global schemas,
mappings and constraints, in a way similar to what has been widely studied in
the literature [16]. The support to this design phase could be provided by already
available external graphical tools, such as the one presented in [11].
     The Evaluation Layer includes everything that allows to transform input
data, schemas, and queries into answers in an effective way. The core role is
played by I -DLV which, as previously pointed out, has been incorporated as
the grounding module of the DLV system. Here we concentrate our attention on
three main logical portions: the Data Processor, the Optimizer, and the CQA
Rewriter.
     The Data Processor highlights some of the advanced functionalities included
in I -DLV; in more detail, the general capabilities of Python-based external atoms
put into play the possibility to include ETL processes inside the ASP engine. This
is a particularly interesting innovation, since reasoning on deductive databases
usually excluded ETL processes that were confined to external workflows. More-
over, ASP-Core-2 compliance of the I -DLV language implies the possibility to
exploit function symbols as predicate arguments; in a database oriented setting,
this allows to easily simulate skolemization. This is a particularly interesting
feature when ontologies are among the inputs; in fact, it is well known that, in
particular cases, value invention in ontologies can be handled via skolemization.
This opportunity significantly expands data integration potentialities of the sys-
tem w.r.t. existing proposals. It is worth observing that, in a parallel project
involving DLV, a more general extension of ASP supporting existentially quan-
tified rule heads, and consequently more complex axioms in ontologies, named
DLV∃ , has been proposed [18]; however this language extension and the corre-
sponding evaluation engine is not included in I -DLV yet.
     The Optimizer applies to the resulting ASP program all database oriented
optimizations previously outlined, and included in I -DLV. In more detail, magic
sets, join optimizations, hypertree decompositions, body orderings and indexing
strategies may altogether provide crucial speedup in query answering processes,
thus allowing the adoption of the system in real application scenarios.
     In order to complete the picture relative to the Evaluation Layer, it is worth
observing that, if the global schema is equipped with constraints that must be
satisfied during data integration, Consistent Query Answering techniques and
optimizations such as the ones presented in [21, 22] can be applied. In Figure
1, this is represented as a functionality external to DLV since it is not included
inside the engine yet. However, it would be straightforward to incorporate them
inside the system since they are based on rewritings of ASP programs.
     Finally, the Presentation Layer is devoted to allow users to compose queries
and get the corresponding results. Again available external graphical tools [11]
can support this phases.
4    Future Work
In this paper we briefly reported on the most recent advancements on the de-
ductive system I -DLV for database oriented features, and we have shown their
application to a data integration setting. In particular, reported features clearly
show that data integration is still a very active and promising research area,
which is kept strongly alive by new challenges arising from ontologies, semi-
structured and unstructured information sources. Given the positive results in
terms of efficiency and extensibility we obtained for the I -DLV system, first of
all we plan to incorporate in I -DLV the features already developed in parallel
projects, such as the CQA rewriting and optimizations techniques and the sup-
port to existential rules introduced in DLV∃ for ontology querying. Moreover, we
plan to explicitly implement connectors to different data formats. As a matter of
facts, reasoning on top of big data is also part of ongoing projects in the research
group.


References
 1. Alviano, M., Calimeri, F., Dodaro, C., Fuscà, D., Leone, N., Perri, S., Ricca, F.,
    Veltri, P., Zangari, J.: The ASP system DLV2. In: LPNMR. Lecture Notes in
    Computer Science, vol. 10377, pp. 215–221. Springer (2017)
 2. Alviano, M., Faber, W., Greco, G., Leone, N.: Magic sets for disjunctive datalog
    programs. Artif. Intell. 187, 156–192 (2012)
 3. Arenas, M., Bertossi, L.E., Chomicki, J.: Specifying and Querying Database Re-
    pairs using Logic Programs with Exceptions. In: Larsen, H.L., Kacprzyk, J.,
    Zadrozny, S., Andreasen, T., Christiansen, H. (eds.) Proceedings of the Fourth In-
    ternational Conference on Flexible Query Answering Systems (FQAS 2000) (2000)
 4. Calı̀, A., Lembo, D., Rosati, R.: Query rewriting and answering under constraints
    in data integration systems. In: IJCAI. pp. 16–21. Morgan Kaufmann (2003)
 5. Calimeri, F., Faber, W., Gebser, M., Ianni, G., Kaminski, R., Krennwallner, T.,
    Leone, N., Ricca, F., Schaub, T.: Asp-core-2: Input language format. ASP Stan-
    dardization Working Group, Tech. Rep (2012)
 6. Calimeri, F., Fuscà, D., Perri, S., Zangari, J.: External computations and interop-
    erability in the new DLV grounder. In: AI*IA. Lecture Notes in Computer Science,
    vol. 10640, pp. 172–185. Springer (2017)
 7. Calimeri, F., Fuscà, D., Perri, S., Zangari, J.: I-DLV: the new intelligent grounder
    of DLV. Intelligenza Artificiale 11(1), 5–20 (2017). https://doi.org/10.3233/IA-
    170104, http://dx.doi.org/10.3233/IA-170104
 8. Calimeri, F., Fuscà, D., Perri, S., Zangari, J.: Optimizing answer set computation
    via heuristic-based decomposition. In: PADL. Lecture Notes in Computer Science,
    vol. 10702, pp. 135–151. Springer (2018)
 9. Calimeri, F., Gebser, M., Maratea, M., Ricca, F.: Design and results of the fifth
    answer set programming competition. Artif. Intell. 231, 151–181 (2016)
10. Calimeri, F., Perri, S., Fuscà, D., Zangari, J.: I-DLV homepage (since 2016), https:
    //github.com/DeMaCS-UNICAL/I-DLV/wiki
11. Febbraro, O., Grasso, G., Leone, N., Reale, K., Ricca, F.: Datalog development
    tools - (extended abstract). In: Datalog. Lecture Notes in Computer Science,
    vol. 7494, pp. 81–85. Springer (2012)
12. Fuscà, D., Calimeri, F., Zangari, J., Perri, S.: I-DLV+MS: preliminary report on an
    automatic ASP solver selector. In: RCRA@AI*IA. CEUR Workshop Proceedings,
    vol. 2011, pp. 26–32. CEUR-WS.org (2017)
13. Gebser, M., Kaminski, R., Kaufmann, B., Romero, J., Schaub, T.: Progress in clasp
    series 3. In: LPNMR. Lecture Notes in Computer Science, vol. 9345, pp. 368–383.
    Springer (2015)
14. Gebser, M., Maratea, M., Ricca, F.: The sixth answer set programming competi-
    tion. J. Artif. Intell. Res. 60, 41–95 (2017)
15. Lembo, D., Lenzerini, M., Rosati, R., Ruzzi, M., Savo, D.F.: Inconsistency-tolerant
    query answering in ontology-based data access. J. Web Sem. 33, 3–29 (2015)
16. Lenzerini, M.: Data integration: A theoretical perspective. In: PODS. pp. 233–246.
    ACM (2002)
17. Leone, N., Gottlob, G., Rosati, R., Eiter, T., Faber, W., Fink, M., Greco, G., Ianni,
    G., Kalka, E., Lembo, D., Lenzerini, M., Lio, V., Nowicki, B., Ruzzi, M., Staniszkis,
    W., Terracina, G.: The INFOMIX System for Advanced Integration of Incomplete
    and Inconsistent Data. In: Proceedings of the 24th ACM SIGMOD International
    Conference on Management of Data (SIGMOD 2005). pp. 915–917. ACM Press,
    Baltimore, Maryland, USA (Jun 2005)
18. Leone, N., Manna, M., Terracina, G., Veltri, P.: Efficiently computable datalog∃
    programs. In: KR. AAAI Press (2012)
19. Leone, N., Ricca, F., Rubino, L.A., Terracina, G.: Efficient application of answer set
    programming for advanced data integration. In: PADL. Lecture Notes in Computer
    Science, vol. 5937, pp. 10–24. Springer (2010)
20. Liang, S., Fodor, P., Wan, H., Kifer, M.: OpenRuleBench: An Analysis of the
    Performance of Rule Engines. In: Proceedings of the 18th International Conference
    on World Wide Web, WWW 2009, Madrid, Spain, April 20-24, 2009. pp. 601–
    610. ACM (2009). https://doi.org/10.1145/1526709.1526790, http://doi.acm.org/
    10.1145/1526709.1526790
21. Manna, M., Ricca, F., Terracina, G.: Consistent query answering via ASP from
    different perspectives: Theory and practice. Theory and Practice of Logic Pro-
    gramming 13(2), 277–252 (2013)
22. Manna, M., Ricca, F., Terracina, G.: Taming primary key violations to
    query large inconsistent data via ASP. TPLP 15(4-5), 696–710 (2015).
    https://doi.org/10.1017/S1471068415000320,                http://dx.doi.org/10.1017/
    S1471068415000320
23. Nardi, B., Reale, K., Ricca, F., Terracina, G.: An integrated environment for rea-
    soning over ontologies via logic programming. In: RR. Lecture Notes in Computer
    Science, vol. 7994, pp. 253–258. Springer (2013)
24. Swift, T., Warren, D.S.: XSB: Extending Prolog with Tabled Logic Pro-
    gramming. Theory and Practice of Logic Programming 12(1-2), 157–187
    (2012). https://doi.org/10.1017/S1471068411000500, http://dx.doi.org/10.1017/
    S1471068411000500
25. Terracina, G., Leone, N., Lio, V., Panetta, C.: Experimenting with recursive queries
    in database and logic programming systems. Theory and Practice of Logic Pro-
    gramming 8, 129–165 (2008)
26. Terracina, G., Martello, A., Leone, N.: Logic-based techniques for data cleaning:
    An application to the italian national healthcare system. In: LPNMR. Lecture
    Notes in Computer Science, vol. 8148, pp. 524–529. Springer (2013)
27. Ullman, J.D.: Principles of Database and Knowledge-Base Systems, Volume I.
    Computer Science Press (1988)

</pre>