=Paper=
{{Paper
|id=Vol-2322/BigVis_9
|storemode=property
|title=VESEL: VisuaL Exploration of Schema Evolution using Provenance Queries
|pdfUrl=https://ceur-ws.org/Vol-2322/BigVis_9.pdf
|volume=Vol-2322
|authors=Christos Athinaiou,Haridimos Kondylakis
|dblpUrl=https://dblp.org/rec/conf/edbt/AthinaiouK19
}}
==VESEL: VisuaL Exploration of Schema Evolution using Provenance Queries==
VESEL: VisuaL Exploration of Schema Evolution Using Provenance Queries Christos Athinaiou Haridimos Kondylakis Computer Science Department, Institute of Computer Science, University of Crete, FORTH, Heraklion, Greece Heraklion, Greece csd3258@csd.uoc.gr kondylak@ics.forth.gr ABSTRACT Independent of the database evolution language used, com- Database schemata are not static artifacts but subject to continu- plex evolution scenarios can easily lead to hundreds of schema ous change as new requirements daily occur and the modeling modification operations, e.g. the evolution of Wikipedia’s schema choices of the past should be updated or adapted. To this direc- through 171 schema versions [12], making it extremely difficult tion, multiple approaches available already try to keep multiple for developers to identify the modeling changes of the past and co-existing schema versions in parallel or model schema evolu- to be able to observe specific decision paths over this complex tion through Schema Modification Operations, known as SMOs. evolution. However, to the best of our knowledge, in the era of big data, In this demonstration, we present the VisuaL Exploration of where thousands of SMOs might appear, it is really hard for devel- Schema Evolution (VESEL) system, enabling developers to issue opers to identify the modeling choices of the past and to explore provenance queries, in order to actively explore the evolution of how a specific column or table has been evolved. In this demo, we the schema. The database evolution language on top of which present VESEL, the first system enabling the VisuaL Exploration our system works is BiDEL, but the modular nature of the system of Schema Evolution using provenance queries. Our approach makes it possible to be extended with other languages in the near relies on a state of the art database evolution language, and can future. It supports how (which schema modification operation), efficiently answer queries about when a specific table or column when (which version) and why (sequence of schema modification has been introduced, how - with which SMO operation and why operation) provenance queries for a specific table or column and - which is the sequence of changes that led to the creation of enables the graphical visualization of the results. As the selected the specific table/column. In the demonstration we will present database evolution language guarantees bidirectionality, queries the architecture of our system and the various algorithms imple- can be issued using a recent or a past schema version and we can mented, enabling end-users to visually explore schema evolution. track the evolution of the specific table/column through multiple Then we will allow conference participants to interact directly schema versions. with the system to test its capabilities. To the best of our knowledge, our approach is the only visual approach enabling active exploration of schema evolution. A preliminary work to the same direction is [13]. However it lacks a 1 INTRODUCTION visual exploration interface, the corresponding algorithms are not As recent databases do not proactively support schema evolu- presented, and it uses PRISM/PRISM++ SMOs not guaranteeing tion, developers have to migrate data from one database version backward query rewriting and data migration. A similar approach to another using error-prone and complex Extract-Transform- but for ontologies has already shown the benefits of such an Load scripts [1–3]. The problem has been recognized by multi- approach [14, 15]. ple research works that try to offer tools for the uninterrupted The remaining of this paper is structured as follows: In section evolution of either semantic [4] and relational schemata. More 2 we describe system’s architecture, the various components used specifically for the relational world, the Model Management ap- and the implemented algorithms. Then, in Section 3 we present proach provides tools to match, diff, merge and extract mappings an outline of our demonstration. Finally, Section 5 concludes this between schema versions [5]. PRISM and PRISM++ [6, 7] allow paper and presents directions for future work. the developers to specify the evolution steps using Schema Mod- ification Operations (SMOs). Those SMOs are then implemented 2 ARCHITECTURE for data migration and for rewriting past queries to work in the VESEL has been implemented using python and flask micro- new versions. PRIMA [8] proposes SMOs that enable forward but framework [16], whereas for the visualization the Cytoscape not backward query rewriting and data migration, CoDEL [9] library [17] has been used. VESEL employees a three layer archi- presents relationally complete SMOs whereas Symmetric Re- tecture, shown in Figure 1. lational Lenses [10] define bidirectional SMOs not relationally On the top layer, we find the graphical user interface (GUI), in complete. Finally, BiDEL [11] presents SMOs that are relationally the middle the query engine and at the bottom the persistence complete, inversible and enable forward and backward query layer. In the sequel we introduce the various components used rewriting and data migration, whereas they can guarantee bidi- and we describe their functionality. rectionality. 2.1 Graphical User Interface © 2019 Copyright held by the owner/author(s). The whole process is started as soon as a file with BiDEL SMOs Published in the Workshop Proceedings of the EDBT/ICDT 2019 Joint Conference (March 26, 2019, Lisbon, Portugal) on CEUR-WS.org. is uploaded to the system or an existing one is selected. The BiDEL database evolution language includes SMOs for creating a textual overview, enabling the user to actively move between versions and SMOs. In the example, shown in Figure 2, we demonstrate the evo- lution of the "revision" table from the wikimedia dataset. The "revision" table has been the result of merging two tables, i.e. the "revisiona" and the "revision_old" table. Before that the table "revisiona" had been created out of the decomposision of the table "revisionb" into "revisiona" and "text". On the other hand the table "revision_old" was the result of a decomposition of the table "old". The graph continues until the tables "cur" and "old" are created for the first time. Figure 1: High-level system architecture. 2.2 Query Engine The query engine receives requests by the GUI and subsequently answers the issued queries. To answer queries about the evolution table with specific fields, dropping and renaming tables, adding of a specific table/column we define first the notion of an affecting and updating columns and merging and decomposing tables. schema modification operation. The complete list of the SMOs available are depicted in Table 1. We have to note that such files can be manually generated or Definition 2.1. Let t be a table/column of a schema version m, generated by tools like the InVerDa1 [18]. For more information namely the Sm , and ∆Sk ,Sm , (k < m) be the sequence of schema on the SMOs and the corresponding rules that add and delete modification operations between Sk and Sm . In addition, let δ a (s) specific schema information, the interested reader is forwarded be the new schema information introduced with the SMO s and to [11]. δd (s) the schema information that is being deleted using s. An SMO s ∈ ∆Sk ,Sm affects the table/column t, denoted by a f (t), if Table 1: The SMOs of the BiDEL schema evolution lan- t ∈ Sm and t ∈ δ a (s) . guage. An affecting schema modification operation captures the way a table/column was introduced between two schema versions. EVOLUTION START [FROM nameOld] The intuition behind this is that, as we usually use the latest EVOLUTION COMMIT AS nameNew; schema version, we base our exploration on this version and we CREATE TABLE R(c 1 ,....,c n ) would like to track how a specific table/field has been introduced. DROP TABLE R RENAME TABLE R INTO R ′ 2.2.1 Answering how provenance queries. ADD COLUMN a AS f (r 1 ,...r n ) INTO Ri DROP COLUMN r FROM Ri DEFAULT f (r 1 ,...,r n ) Assuming that we have the ∆Sk ,Sm available, it is trivial to RENAME COLUMN r IN Ri TO r ′ identify the affecting SMO by just scanning the delta log once. MERGE TABLE R(c R ), S(c S ) INTO T This affecting SMO will be the answer to how provenance queries. DECOMPOSE TABLE R INTO S(s 1 ,...,sn ) As BiDEL guarantees that neither the rules for a single SMO, nor [,T(t1 ,...,tn ) on (PK|FK fk|cond) ] the version genealogy have cycles, the affecting SMO if it exists is unique. As soon as the file is uploaded, it is parsed and stored to the internal database. Instead of the database, the uploaded file could 2.2.2 Answering when provenance queries. also be directly parsed, but for reasons of efficiency we selected to store the SMOs in a database. Then the system is able to answer Next, for answering when provenance queries we are inter- provenance queries. The queries available to the end-users are ested in identifying the version at which the affecting SMO ap- the following: peared. This is again trivial as for each evolution step, we have the complete list of SMOs and we can easily identify the specific (1) How. Which was the operation that introduced a specific evolution step that included the specific affecting SMO. table or a specific column? (2) When. At which version the specific table/column was 2.2.3 Answering why provenance queries. introduced? (3) Why. Which was the sequence of operations that led to Presenting only answers to when and how provenance queries the introduction of a specific table/column? is not enough when complex evolution has occurred and we When the answers are returned, both a graphical and a textual would like to see the sequence of events that led to the creation overview of the answer are presented to the user. A screenshot of of a specific table or column. This sequence can be identified by the system is shown in Figure 2. Each node in the graph represents answering why provenance queries. To be able to answer such an SMO in its short form. When the mouse hovers on top of a queries we first define the change sequence of a specific SMO. node, it additional information is presented as a tooltip, whereas the whole list with the retrieved SMOs is shown on the right panel. Definition 2.2. A change sequence for a schema modification A user can click on a specific SMO and refine the exploration operation s ∈ ∆Sk ,Sm , denoted by CS s , is the minimal sequence based on this specific SMO. This will redraw the canvas and the of schema modification operations in ∆Sk ,Sm such that s ∈ CS s and that when CS s is applied to version k we get the fields/tables 1 https://db47122.inf.tu-dresden.de/ participating in s. Figure 2: Showing the evolution of the "revision" table from the wikimedia dataset . As in essence the whole ∆Sk ,Sm is a set of changes that lead to As such we argue that exploring the evolution of big schemata the fields/tables participating in s, we are looking for the minimal with multiple changes is feasible and efficient. sequence of SMOs, in the sense that one cannot remove any of It is quite easy to adapt the aforementioned algorithm for pre- the SMOs in it, and still when applied to the version k you get senting the change sequence for a specific table/column instead the fields/tables participating in s. As such, a change sequence is of a specific SMO. The idea is first to look for the affecting SMO providing the minimal information in order to understand how and then to use Algorithm 1 for identifying the corresponding the specific table or column has been evolved to reach version m. change sequence. Similarly to an SMO , δ a (CS) is the new schema information The aforementioned algorithm is not only applicable for vi- introduced when CS is being applied to Sk and δd (CS) the schema sualizing the change sequence of a table/column of the latest information that is being deleted from Sk using CS. schema version, but can be also used to visualize the change se- It can be easily proved that a change sequence for the BiDEL quence of a table/column from an arbitrary schema version of the language if it exists, is unique. past. In that case, instead of providing as input to our algorithms Next we present an algorithm that given a sequence of SMOs the ∆Sk ,Sm we could provide the ∆Sk ,Sn where n < m. capturing the ∆Sk ,Sm , it produces the change sequence for an SMO s. The algorithm is shown in Algorithm 1. The idea is 2.2.4 Exploiting bidirectionality and inversibility. that the algorithm starts by an input SMO and identifies the tables/columns that are also added to the schema, possibly by In addition, since BiDEL guarantees bidirectionality and in- deleting other schema parts. Then it searches for the SMOs that versibility (e.g. the inverse of the merge SMO is the decompose delete the added information in order to add new information SMO), we could get the evolution steps for a table/column of a and so on. past schema version till we reach the most recent version by pro- viding as input the ∆Sm ,Sk to our algorithms. This can be directly Algorithm 1 Compute Change Sequence constructed from the ∆Sk ,Sm using the inverse SMOs of those in Input: s, ∆Sk ,Sm the ∆Sk ,Sn . Although bidirectionality and inversibility are not Output: CS s properties of all database evolution languages, if available, they 1: procedure ComputeChangeSeq(s, ∆S k ,Sm ) contribute to a richer exploration experience on the available 2: CS s ← [] schema versions. 3: CS s .push(s) 4: for each s ′ ∈ ∆Sk ,Sm do 3 DEMONSTRATION OUTLINE 5: if δ a (s ′ ) ∈ δd (CS s ) then 6: CS s .push(s ′ ) For demonstrating the functionality of the system, the wikimedia return CS s dataset will be used. The dataset contains more than 170 schema versions with the corresponding SMOs and offers a rich real dataset for experimentation. The demonstration will proceed in The algorithm is immediately proved by construction and has a complexity of O(N ) where N is the number of SMOs in ∆Sk ,Sm . four phases. (1) Exploring wikimedia’s schema and raw evolution log. The [2] H. Kondylakis and D. Plexousakis, “Ontology evolution: Assisting query mi- demonstration will start by visualizing the latest version of gration,” in ER, vol. 7532 of Lecture Notes in Computer Science, pp. 331–344, Springer, 2012. the schema and exploring the various tables and columns. Then the raw evolution log will be opened and the diffi- [3] H. Kondylakis and D. Plexousakis, “Exelixis: evolving ontology-based data culty to identify the actions that led to the generation of integration system,” in SIGMOD Conference, pp. 1283–1286, ACM, 2011. specific column/tables will be explained. [4] H. Kondylakis and D. Plexousakis, “Ontology evolution without tears,” J. Web (2) System overview. Then an overview of the system will be Semant., vol. 19, pp. 42–58, 2013. presented along with the corresponding components. The ideas behind how, when and why provenance queries will [5] P. A. Bernstein and S. Melnik, “Model management 2.0: manipulating richer mappings,” in Proceedings of the ACM SIGMOD International Conference on be explained and the algorithms will be briefly described Management of Data, Beijing, China, June 12-14, 2007, pp. 1–12, 2007. as well. (3) Small tutorial. Next the users will be informed about the [6] C. Curino, H. J. Moon, A. Deutsch, and C. Zaniolo, “Automating the database schema evolution process,” VLDB J., vol. 22, no. 1, pp. 73–98, 2013. functionalities provided through the graphical user inter- face and some examples will be executed demonstrating [7] C. Curino, H. J. Moon, A. Deutsch, and C. Zaniolo, “Update rewriting and various how, when and why provenance queries. More integrity constraint maintenance in a schema evolution support system: specifically, we will demonstrate how and when queries PRISM++,” PVLDB, vol. 4, no. 2, pp. 117–128, 2010. for selected tables and fields and then we will focus on an- [8] H. J. Moon, C. Curino, M. Ham, and C. Zaniolo, “PRIMA: archiving and query- swering why queries on tables that have been constructed ing historical data with evolving schemas,” in Proceedings of the ACM SIGMOD with multiple SMOs. This will enable users to better depict International Conference on Management of Data, SIGMOD 2009, Providence, Rhode Island, USA, June 29 - July 2, 2009, pp. 1019–1022, 2009. the tool’s usage and usefulness. (4) "Hands-on" Phase. In this phase the conference partic- [9] K. Herrmann, H. Voigt, A. Behrend, and W. Lehner, “Codel - A relationally ipants will be invited to directly interact with the sys- complete language for database evolution,” in Advances in Databases and Information Systems - 19th East European Conference, ADBIS 2015, Poitiers, tem, having the chance to explore the evolution of ta- France, September 8-11, 2015, Proceedings, pp. 63–76, 2015. bles/columns through all versions available for the wiki- media dataset. [10] M. Hofmann, B. C. Pierce, and D. Wagner, “Symmetric lenses,” in Proceedings of the 38th ACM SIGPLAN-SIGACT Symposium on Principles of Programming We also intend to release the system online before the con- Languages, POPL 2011, Austin, TX, USA, January 26-28, 2011, pp. 371–384, 2011. ference, so that the conference participants to have a chance of exploring system’s functionalities at their own pace and location. [11] K. Herrmann, H. Voigt, A. Behrend, J. Rausch, and W. Lehner, “Living in parallel realities: Co-existing schema versions with a bidirectional database evolution language,” in Proceedings of the 2017 ACM International Conference 4 CONCLUSIONS on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA, May 14-19, 2017, pp. 1101–1116, 2017. In this demonstration we present a novel system enabling the VisuaL Exploration of multiple Schema Versions (VESEL) using [12] C. Curino, H. J. Moon, L. Tanca, and C. Zaniolo, “Schema evolution in wikipedia provenance queries. The system gets as input the change log - toward a web information system benchmark,” in ICEIS 2008 - Proceedings of with the BiDEL SMO’s over multiple schema versions and is then the Tenth International Conference on Enterprise Information Systems, Volume DISI, Barcelona, Spain, June 12-16, 2008, pp. 323–332, 2008. able to visually answer queries about how a specific table/field was introduced, in which schema version and which was the [13] S. Gao and C. Zaniolo, “Supporting database provenance under schema evolu- sequence of operations that led to the creation of the specific tion,” in Advances in Conceptual Modeling - ER 2012 Workshops CMS, ECDM- NoCoDA, MoDIC, MORE-BI, RIGiM, SeCoGIS, WISM, Florence, Italy, October table/field. 15-18, 2012. Proceedings, pp. 67–77, 2012. As future work, several challenging issues need to be fur- ther investigated. For example, we already see BiDEL as just a [14] H. Kondylakis and N. Papadakis, “Evordf: evolving the exploration of ontology evolution,” Knowledge Eng. Review, vol. 33, p. e12, 2018. language describing SMO’s, with nice features. It is worth investi- gated the other proposed languages as well. This would obviously [15] H. Kondylakis, M. Despoina, G. Glykokokalos, E. Kalykakis, M. Karapiperakis, lead to a redefinition of some of our algorithms. In addition we M. Lasithiotakis, J. Makridis, P. Moraitis, A. Panteri, M. Plevraki, A. Provi- intend to combine our approach with statistical information on dakis, M. Skalidaki, A. Stefanos, M. Tampouratzis, E. Trivizakis, F. Zervakis, E. Zervouraki, and N. Papadakis, “Evordf: A framework for exploring ontology the evolution, further guiding the user understanding on the evolution,” in ESWC (Satellite Events), vol. 10577 of Lecture Notes in Computer schema evolution through multiple versions. Finally, it would be Science, pp. 104–108, Springer, 2017. interesting to to study the effects of visualizing schema evolu- [16] M. Grinberg, Flask Web Development - Developing Web Applications with Python. tion in collaborative/multi-user or distributed environments that O’Reilly, 2014. usually are prone to schema conflicts. [17] M. Franz, C. T. Lopes, G. Huck, Y. Dong, S. O. Sümer, and G. D. Bader, “Cy- toscape.js: a graph theory library for visualisation and analysis,” Bioinformatics, 5 ACKNOWLEDGMENTS vol. 32, no. 2, pp. 309–311, 2016. Work presented in the paper is part of the BOUNCE project (H2020 #777167) that has received funding from the European [18] K. Herrmann, H. Voigt, T. Seyschab, and W. Lehner, “Inverda - the liquid database,” in Datenbanksysteme für Business, Technologie und Web (BTW 2017), Union’s Horizon 2020 Research and Innovation Programme. Any 17. Fachtagung des GI-Fachbereichs „Datenbanken und Informationssysteme" opinions, results, conclusions, and recommendations expressed (DBIS), 6.-10. März 2017, Stuttgart, Germany, Proceedings, pp. 619–622, 2017. in this material are those of the authors and do not necessarily reflect the views of BOUNCE or the European Commission. 6 REFERENCES [1] H. Kondylakis and D. Plexousakis, “Ontology evolution in data integration: Query rewriting to the rescue,” in ER, vol. 6998 of Lecture Notes in Computer Science, pp. 393–401, Springer, 2011.