=Paper= {{Paper |id=Vol-2322/BigVis_9 |storemode=property |title=VESEL: VisuaL Exploration of Schema Evolution using Provenance Queries |pdfUrl=https://ceur-ws.org/Vol-2322/BigVis_9.pdf |volume=Vol-2322 |authors=Christos Athinaiou,Haridimos Kondylakis |dblpUrl=https://dblp.org/rec/conf/edbt/AthinaiouK19 }} ==VESEL: VisuaL Exploration of Schema Evolution using Provenance Queries== https://ceur-ws.org/Vol-2322/BigVis_9.pdf
          VESEL: VisuaL Exploration of Schema Evolution Using
                          Provenance Queries

                         Christos Athinaiou                                                   Haridimos Kondylakis
                    Computer Science Department,                                            Institute of Computer Science,
                        University of Crete,                                                            FORTH,
                         Heraklion, Greece                                                         Heraklion, Greece
                        csd3258@csd.uoc.gr                                                      kondylak@ics.forth.gr

ABSTRACT                                                                          Independent of the database evolution language used, com-
Database schemata are not static artifacts but subject to continu-             plex evolution scenarios can easily lead to hundreds of schema
ous change as new requirements daily occur and the modeling                    modification operations, e.g. the evolution of Wikipedia’s schema
choices of the past should be updated or adapted. To this direc-               through 171 schema versions [12], making it extremely difficult
tion, multiple approaches available already try to keep multiple               for developers to identify the modeling changes of the past and
co-existing schema versions in parallel or model schema evolu-                 to be able to observe specific decision paths over this complex
tion through Schema Modification Operations, known as SMOs.                    evolution.
However, to the best of our knowledge, in the era of big data,                    In this demonstration, we present the VisuaL Exploration of
where thousands of SMOs might appear, it is really hard for devel-             Schema Evolution (VESEL) system, enabling developers to issue
opers to identify the modeling choices of the past and to explore              provenance queries, in order to actively explore the evolution of
how a specific column or table has been evolved. In this demo, we              the schema. The database evolution language on top of which
present VESEL, the first system enabling the VisuaL Exploration                our system works is BiDEL, but the modular nature of the system
of Schema Evolution using provenance queries. Our approach                     makes it possible to be extended with other languages in the near
relies on a state of the art database evolution language, and can              future. It supports how (which schema modification operation),
efficiently answer queries about when a specific table or column               when (which version) and why (sequence of schema modification
has been introduced, how - with which SMO operation and why                    operation) provenance queries for a specific table or column and
- which is the sequence of changes that led to the creation of                 enables the graphical visualization of the results. As the selected
the specific table/column. In the demonstration we will present                database evolution language guarantees bidirectionality, queries
the architecture of our system and the various algorithms imple-               can be issued using a recent or a past schema version and we can
mented, enabling end-users to visually explore schema evolution.               track the evolution of the specific table/column through multiple
Then we will allow conference participants to interact directly                schema versions.
with the system to test its capabilities.                                         To the best of our knowledge, our approach is the only visual
                                                                               approach enabling active exploration of schema evolution. A
                                                                               preliminary work to the same direction is [13]. However it lacks a
1    INTRODUCTION                                                              visual exploration interface, the corresponding algorithms are not
As recent databases do not proactively support schema evolu-                   presented, and it uses PRISM/PRISM++ SMOs not guaranteeing
tion, developers have to migrate data from one database version                backward query rewriting and data migration. A similar approach
to another using error-prone and complex Extract-Transform-                    but for ontologies has already shown the benefits of such an
Load scripts [1–3]. The problem has been recognized by multi-                  approach [14, 15].
ple research works that try to offer tools for the uninterrupted                  The remaining of this paper is structured as follows: In section
evolution of either semantic [4] and relational schemata. More                 2 we describe system’s architecture, the various components used
specifically for the relational world, the Model Management ap-                and the implemented algorithms. Then, in Section 3 we present
proach provides tools to match, diff, merge and extract mappings               an outline of our demonstration. Finally, Section 5 concludes this
between schema versions [5]. PRISM and PRISM++ [6, 7] allow                    paper and presents directions for future work.
the developers to specify the evolution steps using Schema Mod-
ification Operations (SMOs). Those SMOs are then implemented                   2     ARCHITECTURE
for data migration and for rewriting past queries to work in the               VESEL has been implemented using python and flask micro-
new versions. PRIMA [8] proposes SMOs that enable forward but                  framework [16], whereas for the visualization the Cytoscape
not backward query rewriting and data migration, CoDEL [9]                     library [17] has been used. VESEL employees a three layer archi-
presents relationally complete SMOs whereas Symmetric Re-                      tecture, shown in Figure 1.
lational Lenses [10] define bidirectional SMOs not relationally                   On the top layer, we find the graphical user interface (GUI), in
complete. Finally, BiDEL [11] presents SMOs that are relationally              the middle the query engine and at the bottom the persistence
complete, inversible and enable forward and backward query                     layer. In the sequel we introduce the various components used
rewriting and data migration, whereas they can guarantee bidi-                 and we describe their functionality.
rectionality.
                                                                               2.1    Graphical User Interface
© 2019 Copyright held by the owner/author(s).                                  The whole process is started as soon as a file with BiDEL SMOs
Published in the Workshop Proceedings of the EDBT/ICDT 2019 Joint Conference
(March 26, 2019, Lisbon, Portugal) on CEUR-WS.org.                             is uploaded to the system or an existing one is selected. The
                                                                               BiDEL database evolution language includes SMOs for creating a
                                                                        textual overview, enabling the user to actively move between
                                                                        versions and SMOs.
                                                                           In the example, shown in Figure 2, we demonstrate the evo-
                                                                        lution of the "revision" table from the wikimedia dataset. The
                                                                        "revision" table has been the result of merging two tables, i.e.
                                                                        the "revisiona" and the "revision_old" table. Before that the table
                                                                        "revisiona" had been created out of the decomposision of the
                                                                        table "revisionb" into "revisiona" and "text". On the other hand
                                                                        the table "revision_old" was the result of a decomposition of the
                                                                        table "old". The graph continues until the tables "cur" and "old"
                                                                        are created for the first time.

            Figure 1: High-level system architecture.                   2.2    Query Engine
                                                                        The query engine receives requests by the GUI and subsequently
                                                                        answers the issued queries. To answer queries about the evolution
table with specific fields, dropping and renaming tables, adding
                                                                        of a specific table/column we define first the notion of an affecting
and updating columns and merging and decomposing tables.
                                                                        schema modification operation.
The complete list of the SMOs available are depicted in Table 1.
We have to note that such files can be manually generated or               Definition 2.1. Let t be a table/column of a schema version m,
generated by tools like the InVerDa1 [18]. For more information         namely the Sm , and ∆Sk ,Sm , (k < m) be the sequence of schema
on the SMOs and the corresponding rules that add and delete             modification operations between Sk and Sm . In addition, let δ a (s)
specific schema information, the interested reader is forwarded         be the new schema information introduced with the SMO s and
to [11].                                                                δd (s) the schema information that is being deleted using s. An
                                                                        SMO s ∈ ∆Sk ,Sm affects the table/column t, denoted by a f (t), if
Table 1: The SMOs of the BiDEL schema evolution lan-                    t ∈ Sm and t ∈ δ a (s) .
guage.                                                                     An affecting schema modification operation captures the way
                                                                        a table/column was introduced between two schema versions.
         EVOLUTION START [FROM nameOld]                                 The intuition behind this is that, as we usually use the latest
         EVOLUTION COMMIT AS nameNew;                                   schema version, we base our exploration on this version and we
         CREATE TABLE R(c 1 ,....,c n )                                 would like to track how a specific table/field has been introduced.
         DROP TABLE R
         RENAME TABLE R INTO R ′                                           2.2.1   Answering how provenance queries.
         ADD COLUMN a AS f (r 1 ,...r n ) INTO Ri
         DROP COLUMN r FROM Ri DEFAULT f (r 1 ,...,r n )                   Assuming that we have the ∆Sk ,Sm available, it is trivial to
         RENAME COLUMN r IN Ri TO r ′                                   identify the affecting SMO by just scanning the delta log once.
         MERGE TABLE R(c R ), S(c S ) INTO T                            This affecting SMO will be the answer to how provenance queries.
         DECOMPOSE TABLE R INTO S(s 1 ,...,sn )                         As BiDEL guarantees that neither the rules for a single SMO, nor
          [,T(t1 ,...,tn ) on (PK|FK fk|cond) ]                         the version genealogy have cycles, the affecting SMO if it exists
                                                                        is unique.
   As soon as the file is uploaded, it is parsed and stored to the
internal database. Instead of the database, the uploaded file could        2.2.2   Answering when provenance queries.
also be directly parsed, but for reasons of efficiency we selected to
store the SMOs in a database. Then the system is able to answer            Next, for answering when provenance queries we are inter-
provenance queries. The queries available to the end-users are          ested in identifying the version at which the affecting SMO ap-
the following:                                                          peared. This is again trivial as for each evolution step, we have
                                                                        the complete list of SMOs and we can easily identify the specific
    (1) How. Which was the operation that introduced a specific
                                                                        evolution step that included the specific affecting SMO.
        table or a specific column?
    (2) When. At which version the specific table/column was
                                                                           2.2.3   Answering why provenance queries.
        introduced?
    (3) Why. Which was the sequence of operations that led to
                                                                           Presenting only answers to when and how provenance queries
        the introduction of a specific table/column?
                                                                        is not enough when complex evolution has occurred and we
   When the answers are returned, both a graphical and a textual        would like to see the sequence of events that led to the creation
overview of the answer are presented to the user. A screenshot of       of a specific table or column. This sequence can be identified by
the system is shown in Figure 2. Each node in the graph represents      answering why provenance queries. To be able to answer such
an SMO in its short form. When the mouse hovers on top of a             queries we first define the change sequence of a specific SMO.
node, it additional information is presented as a tooltip, whereas
the whole list with the retrieved SMOs is shown on the right panel.        Definition 2.2. A change sequence for a schema modification
A user can click on a specific SMO and refine the exploration           operation s ∈ ∆Sk ,Sm , denoted by CS s , is the minimal sequence
based on this specific SMO. This will redraw the canvas and the         of schema modification operations in ∆Sk ,Sm such that s ∈ CS s
                                                                        and that when CS s is applied to version k we get the fields/tables
1 https://db47122.inf.tu-dresden.de/
                                                                        participating in s.
                     Figure 2: Showing the evolution of the "revision" table from the wikimedia dataset .


   As in essence the whole ∆Sk ,Sm is a set of changes that lead to    As such we argue that exploring the evolution of big schemata
the fields/tables participating in s, we are looking for the minimal   with multiple changes is feasible and efficient.
sequence of SMOs, in the sense that one cannot remove any of              It is quite easy to adapt the aforementioned algorithm for pre-
the SMOs in it, and still when applied to the version k you get        senting the change sequence for a specific table/column instead
the fields/tables participating in s. As such, a change sequence is    of a specific SMO. The idea is first to look for the affecting SMO
providing the minimal information in order to understand how           and then to use Algorithm 1 for identifying the corresponding
the specific table or column has been evolved to reach version m.      change sequence.
   Similarly to an SMO , δ a (CS) is the new schema information           The aforementioned algorithm is not only applicable for vi-
introduced when CS is being applied to Sk and δd (CS) the schema       sualizing the change sequence of a table/column of the latest
information that is being deleted from Sk using CS.                    schema version, but can be also used to visualize the change se-
   It can be easily proved that a change sequence for the BiDEL        quence of a table/column from an arbitrary schema version of the
language if it exists, is unique.                                      past. In that case, instead of providing as input to our algorithms
   Next we present an algorithm that given a sequence of SMOs          the ∆Sk ,Sm we could provide the ∆Sk ,Sn where n < m.
capturing the ∆Sk ,Sm , it produces the change sequence for an
SMO s. The algorithm is shown in Algorithm 1. The idea is                  2.2.4   Exploiting bidirectionality and inversibility.
that the algorithm starts by an input SMO and identifies the
tables/columns that are also added to the schema, possibly by             In addition, since BiDEL guarantees bidirectionality and in-
deleting other schema parts. Then it searches for the SMOs that        versibility (e.g. the inverse of the merge SMO is the decompose
delete the added information in order to add new information           SMO), we could get the evolution steps for a table/column of a
and so on.                                                             past schema version till we reach the most recent version by pro-
                                                                       viding as input the ∆Sm ,Sk to our algorithms. This can be directly
Algorithm 1 Compute Change Sequence                                    constructed from the ∆Sk ,Sm using the inverse SMOs of those in
Input: s, ∆Sk ,Sm                                                      the ∆Sk ,Sn . Although bidirectionality and inversibility are not
Output: CS s                                                           properties of all database evolution languages, if available, they
 1: procedure ComputeChangeSeq(s, ∆S k ,Sm )
                                                                       contribute to a richer exploration experience on the available
 2:    CS s ← []                                                       schema versions.
 3:    CS s .push(s)
 4:    for each s ′ ∈ ∆Sk ,Sm do
                                                                       3     DEMONSTRATION OUTLINE
 5:        if δ a (s ′ ) ∈ δd (CS s ) then
 6:             CS s .push(s ′ )                                       For demonstrating the functionality of the system, the wikimedia
       return CS s                                                     dataset will be used. The dataset contains more than 170 schema
                                                                       versions with the corresponding SMOs and offers a rich real
                                                                       dataset for experimentation. The demonstration will proceed in
   The algorithm is immediately proved by construction and has
a complexity of O(N ) where N is the number of SMOs in ∆Sk ,Sm .
                                                                       four phases.
   (1) Exploring wikimedia’s schema and raw evolution log. The                        [2] H. Kondylakis and D. Plexousakis, “Ontology evolution: Assisting query mi-
       demonstration will start by visualizing the latest version of                      gration,” in ER, vol. 7532 of Lecture Notes in Computer Science, pp. 331–344,
                                                                                          Springer, 2012.
       the schema and exploring the various tables and columns.
       Then the raw evolution log will be opened and the diffi-                       [3] H. Kondylakis and D. Plexousakis, “Exelixis: evolving ontology-based data
       culty to identify the actions that led to the generation of                        integration system,” in SIGMOD Conference, pp. 1283–1286, ACM, 2011.
       specific column/tables will be explained.
                                                                                      [4] H. Kondylakis and D. Plexousakis, “Ontology evolution without tears,” J. Web
   (2) System overview. Then an overview of the system will be                            Semant., vol. 19, pp. 42–58, 2013.
       presented along with the corresponding components. The
       ideas behind how, when and why provenance queries will                         [5] P. A. Bernstein and S. Melnik, “Model management 2.0: manipulating richer
                                                                                          mappings,” in Proceedings of the ACM SIGMOD International Conference on
       be explained and the algorithms will be briefly described                          Management of Data, Beijing, China, June 12-14, 2007, pp. 1–12, 2007.
       as well.
   (3) Small tutorial. Next the users will be informed about the                      [6] C. Curino, H. J. Moon, A. Deutsch, and C. Zaniolo, “Automating the database
                                                                                          schema evolution process,” VLDB J., vol. 22, no. 1, pp. 73–98, 2013.
       functionalities provided through the graphical user inter-
       face and some examples will be executed demonstrating                          [7] C. Curino, H. J. Moon, A. Deutsch, and C. Zaniolo, “Update rewriting and
       various how, when and why provenance queries. More                                 integrity constraint maintenance in a schema evolution support system:
       specifically, we will demonstrate how and when queries                             PRISM++,” PVLDB, vol. 4, no. 2, pp. 117–128, 2010.

       for selected tables and fields and then we will focus on an-                   [8] H. J. Moon, C. Curino, M. Ham, and C. Zaniolo, “PRIMA: archiving and query-
       swering why queries on tables that have been constructed                           ing historical data with evolving schemas,” in Proceedings of the ACM SIGMOD
       with multiple SMOs. This will enable users to better depict                        International Conference on Management of Data, SIGMOD 2009, Providence,
                                                                                          Rhode Island, USA, June 29 - July 2, 2009, pp. 1019–1022, 2009.
       the tool’s usage and usefulness.
   (4) "Hands-on" Phase. In this phase the conference partic-                         [9] K. Herrmann, H. Voigt, A. Behrend, and W. Lehner, “Codel - A relationally
       ipants will be invited to directly interact with the sys-                          complete language for database evolution,” in Advances in Databases and
                                                                                          Information Systems - 19th East European Conference, ADBIS 2015, Poitiers,
       tem, having the chance to explore the evolution of ta-                             France, September 8-11, 2015, Proceedings, pp. 63–76, 2015.
       bles/columns through all versions available for the wiki-
       media dataset.                                                                [10] M. Hofmann, B. C. Pierce, and D. Wagner, “Symmetric lenses,” in Proceedings
                                                                                          of the 38th ACM SIGPLAN-SIGACT Symposium on Principles of Programming
   We also intend to release the system online before the con-                            Languages, POPL 2011, Austin, TX, USA, January 26-28, 2011, pp. 371–384, 2011.
ference, so that the conference participants to have a chance of
exploring system’s functionalities at their own pace and location.                   [11] K. Herrmann, H. Voigt, A. Behrend, J. Rausch, and W. Lehner, “Living in
                                                                                          parallel realities: Co-existing schema versions with a bidirectional database
                                                                                          evolution language,” in Proceedings of the 2017 ACM International Conference
4    CONCLUSIONS                                                                          on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA, May 14-19,
                                                                                          2017, pp. 1101–1116, 2017.
In this demonstration we present a novel system enabling the
VisuaL Exploration of multiple Schema Versions (VESEL) using                         [12] C. Curino, H. J. Moon, L. Tanca, and C. Zaniolo, “Schema evolution in wikipedia
provenance queries. The system gets as input the change log                               - toward a web information system benchmark,” in ICEIS 2008 - Proceedings of
with the BiDEL SMO’s over multiple schema versions and is then                             the Tenth International Conference on Enterprise Information Systems, Volume
                                                                                           DISI, Barcelona, Spain, June 12-16, 2008, pp. 323–332, 2008.
able to visually answer queries about how a specific table/field
was introduced, in which schema version and which was the                            [13] S. Gao and C. Zaniolo, “Supporting database provenance under schema evolu-
sequence of operations that led to the creation of the specific                           tion,” in Advances in Conceptual Modeling - ER 2012 Workshops CMS, ECDM-
                                                                                          NoCoDA, MoDIC, MORE-BI, RIGiM, SeCoGIS, WISM, Florence, Italy, October
table/field.                                                                              15-18, 2012. Proceedings, pp. 67–77, 2012.
   As future work, several challenging issues need to be fur-
ther investigated. For example, we already see BiDEL as just a                       [14] H. Kondylakis and N. Papadakis, “Evordf: evolving the exploration of ontology
                                                                                          evolution,” Knowledge Eng. Review, vol. 33, p. e12, 2018.
language describing SMO’s, with nice features. It is worth investi-
gated the other proposed languages as well. This would obviously                     [15] H. Kondylakis, M. Despoina, G. Glykokokalos, E. Kalykakis, M. Karapiperakis,
lead to a redefinition of some of our algorithms. In addition we                          M. Lasithiotakis, J. Makridis, P. Moraitis, A. Panteri, M. Plevraki, A. Provi-
intend to combine our approach with statistical information on                            dakis, M. Skalidaki, A. Stefanos, M. Tampouratzis, E. Trivizakis, F. Zervakis,
                                                                                          E. Zervouraki, and N. Papadakis, “Evordf: A framework for exploring ontology
the evolution, further guiding the user understanding on the                              evolution,” in ESWC (Satellite Events), vol. 10577 of Lecture Notes in Computer
schema evolution through multiple versions. Finally, it would be                          Science, pp. 104–108, Springer, 2017.
interesting to to study the effects of visualizing schema evolu-
                                                                                     [16] M. Grinberg, Flask Web Development - Developing Web Applications with Python.
tion in collaborative/multi-user or distributed environments that                         O’Reilly, 2014.
usually are prone to schema conflicts.
                                                                                     [17] M. Franz, C. T. Lopes, G. Huck, Y. Dong, S. O. Sümer, and G. D. Bader, “Cy-
                                                                                          toscape.js: a graph theory library for visualisation and analysis,” Bioinformatics,
5    ACKNOWLEDGMENTS                                                                      vol. 32, no. 2, pp. 309–311, 2016.
Work presented in the paper is part of the BOUNCE project
(H2020 #777167) that has received funding from the European                          [18] K. Herrmann, H. Voigt, T. Seyschab, and W. Lehner, “Inverda - the liquid
                                                                                          database,” in Datenbanksysteme für Business, Technologie und Web (BTW 2017),
Union’s Horizon 2020 Research and Innovation Programme. Any                               17. Fachtagung des GI-Fachbereichs „Datenbanken und Informationssysteme"
opinions, results, conclusions, and recommendations expressed                             (DBIS), 6.-10. März 2017, Stuttgart, Germany, Proceedings, pp. 619–622, 2017.
in this material are those of the authors and do not necessarily
reflect the views of BOUNCE or the European Commission.

6    REFERENCES
 [1] H. Kondylakis and D. Plexousakis, “Ontology evolution in data integration:
     Query rewriting to the rescue,” in ER, vol. 6998 of Lecture Notes in Computer
     Science, pp. 393–401, Springer, 2011.