=Paper= {{Paper |id=Vol-1735/paper2 |storemode=property |title=Chronos - Scalable Model Versioning, Querying & Persistence |pdfUrl=https://ceur-ws.org/Vol-1735/paper2.pdf |volume=Vol-1735 |dblpUrl=https://dblp.org/rec/conf/models/Haeusler16 }} ==Chronos - Scalable Model Versioning, Querying & Persistence == https://ceur-ws.org/Vol-1735/paper2.pdf
                                  Chronos
    Scalable Model Versioning, Querying & Persistence

                                 Martin Haeusler

            University of Innsbruck, Department of Computer Science
                          martin.haeusler@uibk.ac.at



     Abstract. The discipline of model engineering has matured consider-
     ably over the last years and is applied in a wide array of domains, ranging
     from embedded software development to IT landscape documentation. A
     variety of different model repositories attempt to meet the requirements
     that emerge when working on models, most notably efficient persistence,
     versioning and query capabilities. However, existing tools scale poorly
     when faced with large models used in practice, in particular in scenarios
     where (semi-)automated element generation pushes model sizes to hun-
     dred thousand individual elements and beyond. In this paper, we present
     our research agenda for Chronos1 , an effort which aims to provide a so-
     lution to this problem in the form of a novel model repository for EMF
     Ecore models.


     1    Problem Description
     In recent years, the importance and popularity of model engineering has
     increased considerably, both in academic as well as in industrial set-
     tings [3]. Models are being created in a variety of languages and frame-
     works, with EMF Ecore [12] and UML [8] as the most prominent exam-
     ples. As models grow larger and get more sophisticated, the demand for
     tools that support collaboration on model editing also increases. Such
     tools are often referred to as model repositories. Their core features en-
     compass storing, versioning and querying model data. Pierantonio et al.
     have recently assembled and published a list of existing model repos-
     itories [2]. The list contains open-source pojects like Eclipse Common
     Data Objects (CDO)2 and EMFStore3 as well as commercial systems
     such as MagicDraw Teamwork Server4 . In our experience, none of these
     tools scales well with models of large sizes, most notably due to per-
     formance issues and/or lack of features (e.g. versioning capabilities and
     expressiveness of queries). An important factor for scalability is the em-
     ployed persistence technology, and all aforementioned model repositories
     use one of two technologies:
1
  This work was partially funded by the research project “QE LaB - Living Models
  for Open Systems” (FFG 822740) and “txtureSA” (FWF-Project P 29022).
2
  https://eclipse.org/cdo/
3
  http://www.eclipse.org/emfstore/
4
  http://www.nomagic.com/products/teamwork-server.html
         – File / XML-Based
           Model repositories in this category store serialized forms of models
           in files (typically XML, or XMI [9] in the case of Ecore). For version-
           ing, this implies the existence of one file per version. EMFStore and
           MagicDraw Teamwork server are representatives of this category.
         – Relational / SQL-Based
           Tools based on relational technology aim to perform a mapping from
           the object representation into an equivalent relational representa-
           tion. This process is referred to as object-relational mapping, or
           O/R-mapping. The relational information is then stored in a tra-
           ditional database system. Versioning introduces additional entries
           in the relational tables to store the state of an element at a given
           version. Eclipse CDO makes use of this approach.

        Both techniques suffer from severe drawbacks and are not optimal for
        storage of model data. XML-based systems struggle with per-element
        versioning and the absence of indexing structures for querying, as well
        as having difficulties in providing lazy loading capabilities, because any
        given XML file must usually be processed in its entirety before indi-
        vidual elements can be extracted. Such processing (i.e. serialization or
        deserialization) is also very costly with respect to CPU power, RAM
        and runtime. Relational backends (e.g. SQL databases) provide indexing
        structures and per-element loading while also working on a user-defined
        schema which allows for versioning, provided that the repository takes
        care of this aspect on its own. The major problem with relational back-
        ends is the expensive Object–Relational Mapping (O/R mapping) pro-
        cess that converts objects into table entries and vice versa. O/R mapping
        increases considerably in complexity when objects with many connec-
        tions to other objects have to be processed, which is a very common
        use case for modeling. Typical O/R mapping algorithms are often imple-
        mented in a recursive fashion5 which can lead to call stack overflows on
        sufficiently large models. Converting the relational representation back
        into model element objects requires at least one SQL JOIN operation
        per connection. These operations have inherent quadratic complexity and
        therefore scale poorly.
        We aim to improve the situation by adressing the hot spots of resource
        consumption: object (de-)serialization, storage of versioned data and
        queries on the persisted information. We do so by applying concepts from
        the NoSQL area and combining them with suitable mapping strategies
        and a novel approach to storage of versioned data.

        The remainder of this document is structured as follows. Section 2 pro-
        vides an overview of the related work. In Section 3 we present an outline
        of our solution. We present the expected contributions of the thesis in
        Section 4 and outline a plan for evaluation and validation of our approach
        in Section 5. Finally, Section 6 provides details on the current state of
        the project and Section 7 concludes the paper with a summary.

5
    As for example in Hibernate: http://hibernate.org/
2    Related Work

Having realized the shortcomings of the existing storage solutions for
large models as described in Section 1, Gómez et al. proposed and imple-
mented alternatives for EMF based on NoSQL technology, using graphs [1]
and later also using key-value stores [5]. Their results clearly demonstrate
that storing model data in a graph or key-value format is not only fea-
sible, but also performs a lot better than traditional relational storage
mechanisms.
In 2014, Felber et al. proposed a set of algorithms that enable versioning
on a key-value store [4]. However, the work of Felber focuses exclusively
on versioning of non-connected data, while Gómez considered graph-
based storage without versioning aspects. The idea of graph-based model
persistence in a versioned key-value store backend forms the foundation
of our own work.
As the resulting artifact of this PhD thesis is going to be a model reposi-
tory, all the tools listed by Pierantonio et al. [2] are considered as related
work. There are two major conceptual differences between our solution
and existing repositories. The first difference is the level of abstraction
in the technology. All relevant existing tools are either built on top of a
pre-existing database or file format. We are going to develop and provide
the entire data management stack, from the file format to the database
and transaction management to model querying. The second difference
is that our approach takes the latest advances in NoSQL research into
account, which opens many possibilities, in particular with respect to
model queries and performance.



3    Proposed Solution

We propose a model repository that is capable of handling tightly cou-
pled, large-scale models with 100000 individual elements and beyond.
The features will include a rich query API and full per-element versioning
support, as well as important collaboration features such as lightweight
branching and conflict detection.



                              Application           EMF




                              ChronoSphere
                    Chronos




                                                    Ecore



                              ChronoGraph
                              ChronoDB

             Fig. 1. The Chronos Data Management Stack
       This project is named Chronos, and consists of three main parts (collec-
       tively called Chronos Components). These components build on top of
       each other:
         – ChronoDB is the storage backend, and the lowest layer in the
           Chronos project. It is a key-value store with built-in versioning capa-
           bilities, to which we refer to as a Temporal Key-Value Store. Chron-
           oDB is intended primarily as a lightweight storage backend for em-
           bedding into an application that writes to the local hard drive. In the
           repository, it is responsible for storage, versioning and branching. We
           implement these features on the lowest level because the key-value
           format is conceptually simple compared to an object graph, which
           reduces the complexity of the versioning problem.
         – ChronoGraph is an implementation of the Apache Tinkerpop6
           graph computing API, mapping graph structures onto the key-value
           schema provided by ChronoDB. Several other libraries, including
           Titan DB7 , have demonstrated that it is possible and feasible to im-
           plement a graph database based on a key-value store. ChronoGraph
           provides the low-level query API and a standardized storage format
           to the repository.
         – ChronoSphere is an EMF Ecore model repository. Built on top
           of ChronoGraph, it maps incoming model data onto a standard-
           ized graph structure. By leveraging the capabilities of the TinkerPop
           graph query language Gremlin, it will provide expressive model-level
           queries to programmers, as well as the versioning and branching fea-
           tures provided by ChronoDB.
       In combination, these three components form a new type of model repos-
       itory that implements the full data management stack. Aside from this
       top-level goal, due to the modular design, these three components can
       also be used individually in other projects: ChronoDB is a general-
       purpose key-value store with versioning support, and ChronoGraph im-
       plements the Apache TinkerPop API, the de-facto standard API for
       graph databases, enabling it to act as drop-in replacement for other im-
       plementations.


       4     Expected Contributions
       The thesis will provide the following contributions to the theory of tem-
       poral data stores and model repositories:


       4.1    Contributions to Theory
       A formal model for a Temporal Key-Value Store: This formaliza-
       tion is based on an infinite two-dimensional matrix structure that defines
       and exclusively relies on history–preserving append–only operations. It
       provides the formal semantics which guide the implementation of the
6
    http://tinkerpop.incubator.apache.org/
7
    http://thinkaurelius.github.io/titan/
prototype.

A novel query framework: By utilizing the features offered by the
NoSQL graph database, ChronoSphere will provide an entirely new ap-
proach to model queries. It will allow developers to write queries at model
level in an internal, Java-embedded domain-specific language that can
be checked by the compiler, providing additional compile-time recog-
nition of e.g. type system errors or spelling mistakes in comparison to
string-based alternatives, such as OCL [10] which can only be checked at
run-time. This language makes use of index structures that are updated
automatically when the model is changed. These indices are aware of the
versioning aspects, allowing equally fast queries on any version of the
model. Lazy evaluation will be the default mode of operation, enabling
the execution of queries without prior need for full resolution of a model
version.


4.2   Contributions to Practice
Alongside and based on the contributions to theory, the thesis is expected
to make a number of contributions to practice. We categorize them by
software artefact.

Graph Database and Key–Value store: To the best of our knowl-
edge, ChronoGraph will be the first implementation of the Apache Tin-
kerpop API that offers full versioning support which is provided by the
temporal key–value store in ChronoDB. Lightweight branching, as seen
in popular version control systems such as Git or SVN, is also part of
the versioning engine. ChronoGraph will be the first graph database
with full ACID transaction support with the highest possible isolation
level (“serializable” [7]). By taking advantage of versioning, ChronoDB
and ChronoGraph support long–running transactions without sacrificing
throughput of concurrent short-lived transactions, which is of particu-
lar importance in modeling scenarios that involve model analysis. Both
ChronoDB and ChronoGraph offer timestamp–agnostic queries that can
be executed on any model version in time without modifications by in-
jecting the desired timestamp from the transaction metadata. This can
be used in a variety of ways, e.g. for after–the–fact collection of time
series data in the history, or for comparing the result of a model query
at two different points in time. Temporal indices allow for equal query
performance on any model version, while temporal conflict detection pro-
tects users from history corruption, data loss and other anomalies.

Model Repository: The ACID nature of ChronoGraph transactions
allows ChronoSphere clients to work on consistent views on the EMF
Ecore model data, which is important e.g. for analysis and refactoring
tasks. The repository is designed to handle hundred thousands of indi-
vidual elements and beyond. This is achieved by lazy loading of EOb-
jects and their automatic unloading in case they are no longer needed
by the application. In order to support the insertion of large models into
the repository, ChronoSphere provides batch–based incremental commits
       that are internally merged into a single model version when the last batch
       is written. ChronoSphere aims for the best possible EMF ecosystem in-
       tegration and to provide EObjects that are compatible with popular
       Ecore–based frameworks, such as EMF Compare.



       5    Plan for Evaluation and Validation

       The research methodology applied for this thesis follows the principles
       of Design Science as defined by Peffers [11] and Hevner [6]. The system
       will be evaluated using a variety of techniques:
         – Prototype Implementation: The implementation of the proto-
            type serves as the proof–of–concept for the theoretical foundations
            and is the main design artefact. An extensive automated test suite
            developed alongside the prototype will assert that its functionality
            behaves as intended and in particular properly implements the the-
            oretical foundations defined in the thesis.
         – Performance Measurements: Measuring the performance (e.g.
            CPU and RAM usage) of the prototype implementation and com-
            paring it to other model repositories will provide the necessary data
            for a discussion on the factual scalability of the model repository.
         – Industrial Case Study: Several Chronos components are already
            being used in an industrial setting as the primary storage backend for
            the IT Landscape Documentation tool Txture 8 [13]. The deployment
            of Txture with Chronos components at industrial research partners
            allows to conduct case studies in real world scenarios beyond labo-
            ratory conditions.



       6    Current Status

       As of July 2016, large parts of Chronos have already been implemented.
       The core of ChronoDB is complete9 , and a paper titled Scalable Ver-
       sioning for Key-Value Stores has been accepted by and will be published
       at the 5th International Conference on Data Management Technologies
       and Applications (DATA 2016). This paper covers the theoretical foun-
       dations and implementation aspects of ChronoDB. The evaluation of the
       core components of ChronoGraph as a backend for Txture is currently
       ongoing and shows very promising early results with respect to perfor-
       mance. A publication on this topic is the next step. The implementation
       of ChronoSphere is currently work-in-progress. We plan a publication on
       this topic at the MODELS Conference 2017. For the final publication,
       we aim for a journal paper in 2018 that summarizes our findings and
       combines them with the results of an industrial case study. We aim for
       the conclusion and publication of the PhD thesis by the end of 2018.
8
    www.txture.tools
9
    https://github.com/MartinHaeusler/chronos/tree/master/chronodb
7    Summary and Conclusion
In this paper, we have provided an overview over the Chronos project
that aims to provide a scalable solution for storing, versioning and query-
ing model data, based on NoSQL graph and key-value techniques. In
contrast to existing solutions, it will not rely upon XML serialization or
object-relational mappings. Instead, the entire data management stack
will be implemented from scratch, with the use case of storing large mod-
els in mind. The resulting artefact will serve as a proof–of–concept and
will be tested against an automated test suite and in an industrial case
study. Comparative benchmarks with existing repositories are also part
of the evaluation. The planned end of the PhD thesis is in 2018, with a
total of three conference publications and one journal paper.


References
 1. Benelallam, A., Gómez, A., et al.: Neo4EMF, a scalable persistence
    layer for EMF models. In: Modelling Foundations and Applications,
    pp. 230–241. Springer (2014)
 2. Di Rocco, J., Di Ruscio, D., Iovino, L., Pierantonio, A.: Collaborative
    Repositories in Model-Driven Engineering. IEEE Software 32(3), 28–
    34 (May 2015)
 3. Di Ruscio, D., De Lara, J., Pierantonio, A.: Proceedings of the 3rd
    Workshop on Extreme Modeling at MoDELS 2014. CEUR, Valencia,
    Spain (2014), http://ceur-ws.org/Vol-1239/
 4. Felber, P., Pasin, M., et al.: On the Support of Versioning in Dis-
    tributed Key-Value Stores. In: 33rd IEEE International Symposium
    on Reliable Distributed Systems, SRDS 2014, Nara, Japan, October
    6-9, 2014. pp. 95–104 (2014)
 5. Gómez, A., Tisi, M., Sunyé, G., Cabot, J.: Map-Based Transpar-
    ent Persistence for Very Large Models. Fundamental Approaches to
    Software Engineering 9033, 19–34 (2015)
 6. Hevner, A.R., March, S.T., Park, J., Ram, S.: Design Science in
    Information System Research. MIS Quaterly 28(1), 75–105 (2004)
 7. ISO: SQL Standard 2011 (ISO/IEC 9075:2011) (2011)
 8. Object Management Group: UML 2.3 Superstructure (2010), http:
    //www.omg.org/spec/UML/2.3
 9. OMG: XML Metadata Interchange (XMI). OMG (2007),
    http://www.omg.org/technology/documents/modeling_spec_
    catalog.htm#XMI
10. (OMG), O.M.G.: Object constraint language (ocl). version 2.3.1
    (2012), http://www.omg.org/spec/OCL/2.3.1/
11. Peffers, K., Tuunanen, T., Rothenberger, M.A., Chatterjee, S.: A de-
    sign science research methodology for information systems research.
    Journal of management information systems 24(3), 45–77 (2007)
12. Steinberg, D., Budinsky, F., Merks, E., Paternostro, M.: EMF:
    eclipse modeling framework. Pearson Education (2008)
13. Trojer, T., Farwick, M., Häusler, M., Breu, R.: Living Models of
    IT Architectures: Challenges and Solutions. Software, Services and
    Systems 8950, 458–474 (2015)