Towards Flexible Parsing of Structured Textual Model Representations Dimitrios S. Kolovos1 , Nicholas Matragkas2 , and Antonio Garcia-Dominguez3 1 Department of Computer Science, University of York, UK dimitris.kolovos@york.ac.uk 2 Department of Computer Science, University of Hull, UK n.matragkas@hull.ac.uk 3 Department of Computer Science, Aston University, UK a.garcia-dominguez@aston.ac.uk Abstract. Existing parsers for textual model representation formats such as XMI and HUTN are unforgiving and fail upon even the smallest inconsistency between the structure and naming of metamodel elements and the contents of serialised models. In this paper, we demonstrate how a flexible parsing approach can transparently and automatically resolve a number of these inconsistencies, and how it can eventually turn XML into a human-readable and editable textual model representation format for particular classes of models. 1 Introduction Depending on the nature of a modelling language and the preferences of the engineers that work with it, models conforming to the language can be edited using graphical (e.g. diagram-based, tree-based, table-based), textual, or hybrid notations. Previous work has made the case for the value of text-based model representations as a means of leveraging existing powerful text editing tools and file-based version control systems [1]. In scenarios where text-based model editing is preferable for a new Domain-Specific Language (ie. metamodel), engineers are presented with two options. – Define a bespoke textual syntax for the modelling language using frameworks such as Xtext, Spoofax, or EMFText; – Use a generic syntax such as OMG’s Human-Usable Textual Notation (HUTN) [2] or XML Metadata Interchange (XMI) [3]. The first approach can produce a tailored and concise syntax but requires substantial expertise and upfront development effort, and incurs a maintenance cost over time as the underlying framework/IDE evolves. The second approach on the other hand, does not require an investment in bespoke tool development, but comes at the cost of a more verbose and repetitive textual syntax. To a large extent, this is due to the rigidness of existing parsers for such generic textual syntaxes. Existing parsers for XMI and HUTN require exact lexical equivalence between names defined in the metamodel (e.g. type and attribute names) and tokens found in serialised models and fail even upon the smallest inconsistency. In this paper we propose a novel approach for parsing structured XML doc- uments into in-memory models that conform to Ecore metamodels1 . Instead of requiring exact lexical equivalence, the proposed approach uses approximate matching to map XML element and attribute names to metamodel types and their structural features. We use XML as an example structured format; the proposed approach is trivially portable to other textual formats such as HUTN, JSON and YAML. The rest of the paper is organised as follows. Section 2 presents our flexible XML parsing algorithm with the help of a running example and discusses an open-source implementation of the algorithm (Flexmi) and supporting tooling, Section 3 presents the results of empirical evaluation experiments on the scala- bility and the practicality of the proposed algorithm, Section 4 discusses related work, and Section 5 concludes the paper and provides directions for further work on this topic. 2 Fuzzy Reflective Parsing In this section, we present a novel flexible algorithm (parser) for parsing an XML document into an in-memory model that conforms to an Ecore metamodel. It is worth noting that the parser requires input XML documents to be well-formed2 . The parser performs a depth-first traversal of the elements of the XML doc- ument, in which each element is visited exactly once. Before encountering the root element of the document, the parser expects to find an nsuri processing in- struction (see line 2 in Listing 1.2) that declares the unique identifier (namespace URI) of the Ecore metamodel that the model is an instance of. In its current form, the parser supports only models that conform to a single Ecore meta- model (EPackage), but it is trivially extensible to support models conforming to multi-EPackage metamodels. The parser uses a (initially empty) stack, to keep track of its position during the depth-first traversal of the XML document. When the root element of the XML document is encountered, the parser computes the string similarity3 of the tag of the element with all instantiable (i.e. non-abstract) EClasses in the meta- model and selects the EClass with the maximum similarity. An instance of that EClass is created, is set as a top-level element in the in-memory representation of the model, and is pushed to the stack. The parser then proceeds in a recursive depth-first manner to visit the de- scendants of the root element. When it encounters a new XML element, it follows the algorithm displayed in Listing 1.1. When it encounters the closing tag of an 1 Ecore is the metamodelling language of the Eclipse Modelling Framework. 2 https://www.w3.org/TR/REC-xml/#sec-well-formed 3 In this work we use Levenshtein’s string editing distance [4] to compute string sim- ilarities, however, other algorithms can also be used for this purpose. XML element, it pops the top item of the stack (as discussed in Listing 1.1, this may be a model element, a containment slot, or null ). A step-by-step discussion of the execution of the algorithm on a sample XML document is provided below. The algorithm presented in Listing 1.1 explains how XML elements are mapped to model elements but does not discuss how the attributes of XML elements are mapped to attributes and non-containment references of the re- spective model elements; this is delegated to the set_attribute_values function which is invoked in lines 23 and 36, and is presented in Listing 1.3. A running example follows in Section 2.1. Listing 1.1. Algorithm executed when the start tag of a new XML element is encoun- tered by the parser 1 procedure fuzzy_parse(xml_element, stack) 2 let parent = stack.peek() 3 if parent is a model element then 4 let parent_model_element = parent 5 6 if xml_element has no attributes and only text then 7 /* xml_element is interpreted as an attribute value */ 8 let attr = the attribute of the parent_model_element’s EClass with the highest string similarity to the name of the xml_element 9 let parent_model_element.attr = textual content of the xml_element, cast appropriately 10 push null to the stack 11 end 12 else if xml_element has no attributes then 13 /* xml_element is interpreted as a containment slot */ 14 let reference = the containment reference of the parent_model_element’s EClass with the highest string similarity to the tag of the xml_element 15 push a containment slot that encapsulates the parent_model_element and the reference to the stack 16 end 17 else /*The element has attributes*/ 18 /* xml_element is interpreted as a model element */ 19 let candidate_types = all types of parent_model_element’s containment references, and their sub-types 20 let new_model_element_type = the EClass from candidate_types with the highest string similarity to the name of the xml_element 21 let reference = the first reference of the EClass of parent_model_element that has a compatible type with new_model_element_type 22 let new_model_element = new instance of new_model_element_type 23 call set_attribute_values (new_model_element, xml_element) 24 if the reference is single-valued then 25 set the value of the reference to new_model_element 26 else 27 add new_model_element to the list of values of the reference 28 end if 29 push new_model_element to the stack 30 end 31 else if parent is a containment slot then 32 /* xml_element is interpreted as a model element */ 33 let containment_slot = parent 34 let new_model_element_type = the EClass with the highest similarity to the tag of the xml_element, chosen from the EClasses that are subtypes of the type of the reference of containment_slot (including the reference type itself ) 35 let new_model_element = new instance of new_model_element_type 36 call set_attribute_values (new_model_element, xml_element) 37 if the containment_slot’s reference is single-valued then 38 set the value of the reference of containment_slot’s model element to new_model_element 39 else 40 add new_model_element to the list of values of the reference 41 end 42 end 43 end procedure Fig. 1. Messaging System Metamodel 2.1 Running Example This section presents the application of the algorithm of Listing 1.1 on the XML document illustrated in Listing 1.2. The Ecore metamodel that corresponds to the http://messsaging namespace URI in line 2 of the XML document is pre- sented in Figure 1. During its application, the algorithm performs the following steps: Listing 1.2. Example XML document 1 2 3 4 5 6 7 8 9 Fuzzy parsing is 10 so cool. 11 12 13 14 15 16 17 18 19 20 21 22 23 1. In line 2 of the XML document, the parser encounters the nsuri processing instruction, through which the XML document declares that it should be parsed as an instance of the http://messsaging Ecore metamodel (illustrated in Figure 1); 2. In line 3, the parser encounters the sys element. As discussed above, since the parser’s stack is empty, it knows that it is at the root of the document and as such it calculates the similarity of the tag name (sys) against the names of all instantiable classes in the metamodel and decides that System is the closest match. As such, it creates an instance of System, places it as a top-level element of the model and pushes it to the stack. 3. In line 4, the parser encounters the u XML element. It peeks at the stack and sees that the top item is a model element (the instance of System that was created in the step above). Since the new XML element has attributes and complex children, as specified in line 39 of Listing 1.1, the parser at- tempts to convert it into a new model element. Since u is contained under a System element, according to the metamodel, its only two possible types are Parameter and User. Using string similarity, the parser decides that u is closer to User, creates an instance of the latter, adds it to the users reference of the System instance, and pushes it to the stack; 4. In line 5, the parser encounters the box XML element and behaves similarly to step 3: it creates an instance of Mailbox, adds it to the values of the mailbox reference of the User created in step 3, and pushes it to the stack. Here it’s worth noting that since according to the metamodel only instances of Mailbox can be contained under instances of User, the actual tag of the element (box ) is irrelevant. As such, if the Mailbox was to be renamed in an evolved version of the metamodel (e.g. to Channel ), the parser would still construct a valid model; 5. In line 6, the parser encounters the in XML element. It peeks at the stack and sees that the top element is a model element (the instance of Mailbox created in the previous step). Since the in element has complex children but no attributes, according to line 12 of Listing 1.1, it treats it as a containment slot. As such it compares its tag name against the names of containment references in the Mailbox type (incoming and outgoing) and selects incoming as the closest match. It then encapsulates the incoming reference and the Maibox element into a containment slot object and pushes the latter to the stack; 6. In line 7, the parser encounters the msg element. It peeks at the stack and sees that the top element is the containment slot pushed in the previous step. As such, according to line 31 of Listing 1.1, it treats the msg element as a model element that needs to be added to the containment slot (i.e. to the incoming messages of the Mailbox created in step 4). Similarly, to step 4, the tag name of the XML element is irrelevant since there is only one type of model elements that can be contained under the incoming reference. As such, if the name of the Message class was to change in a revised version of the metamodel, the parser would still produce a valid model; 7. In line 8, the parser encounters the body XML element. It peeks at the stack and sees that the top element is the Message model element pushed in step 7. Since the body element does not have attributes and only has textual content, the parser treats its content as an attribute value. It identifies the actual attribute to be populated (body) by comparing the string similarity of the tag of the element against the names of all attributes in the Message class. Finally, it pushes null to the stack as no further elements exist under the body element; 8. In lines 11-15 the parser encounters the close tags of the elements started in lines 4-8 and for each close tags it pops the top item of the stack. After these lines are processed the top element of the stack is the instance of System pushed in step 2; 9. Lines 16-22 are processed in a very similar manner to lines 4-15, with the main difference being that the out element in line 18 is interpreted as a containment slot for the outgoing reference of class Mailbox – in contrast to step 5 where the in element was interpreted as a slot for the incoming reference. 2.2 Processing XML Attribute Values Once the parser has determined that an XML element needs to be transformed into a model element, beyond instantiating the new model element, it also at- tempts to use the attributes of the XML element to populate the values of EAttributes and non-containment EReferences of the new model element. The first step of this process involves computing an optimal assignment of XML at- tributes to EAttributes and EReferences of the new model element (i.e. deciding which attribute will be mapped to which feature). This is an instance of the as- signment problem – as every attribute in the XML element needs to be assigned to at most one EAttribute or EReference of the new model element – which the parser solves using the Hungarian method [5]. Once the optimal assignment between XML attributes and EAttributes/ERef- erences has been identified, the next step is to use the values of the prior to populate the values of the latter. This is achieved using the algorithm in List- ing 1.3. Briefly, for multi-valued features the value of the XML attribute is first split using comma (,) as a separator and is then processed as follows. If the mapped feature is an EAttribute, then each fragment is cast to the datatype of the EAttribute and is set/added to the latter’s value. For example when the algorithm processes line 17 of the Listing 1.2, it matches the q attribute to the quota feature of EClass Mailbox, it converts the value of q into an integer, and it then assign the quota attribute of the mailbox to it. Type casting errors are recorded so that they can be reported to clients of the algorithm. If the mapped feature is an EReference, the algorithm temporarily records an unresolved reference. All unresolved references collected at this stage are attempted to be resolved after the entire XML document has been processed by the algorithm. References are resolved by matching against the IDs of model elements of compatible types. The ID of a model element is determined as follows: – If the EClass of the model element has an EAttribute marked as ID 4 , the ID of the model element is the value of that EAttribute; – Else the ID of the model element is the name of its ID or name EAttribute (if one exists); – In all other cases the model element is assigned an internal ID and cannot be referenced by other elements. Listing 1.3. Algorithm that maps XML element attribute values to values of EAt- tributes and non-containment EReferences 1 procedure set_attribute_values(xml_element, model_element) 2 let assignment = the assignment with the maximum total similarity 3 foreach xml_attribute of the xml_element do 4 let feature = the corresponding feature of the xml_attribute in the assignment 5 if feature is not null then 6 let value = the value of the xml_attribute 7 if feature is an EAttribute then 8 if the feature is single-valued then 9 set the value of the feature to value (typecast) 10 else 11 let values = split value by comma 12 set the value of the feature to values (typecast) 13 end 14 end 15 else /* the feature is an EReference */ 16 if the feature is single-valued then 17 create an unresolved reference 18 else 19 let values = split value by comma 20 foreach value in values do 21 create an unresolved reference 22 end 23 end 24 end 25 end 26 end 27 end procedure 2.3 Implementation We have implemented the flexible reflective parsing algorithm discussed above on top of the Eclipse Modelling Framework, in the form of an implementa- tion of EMF’s Resource interface. The implementation (Flexmi) is available as open-source software under the permissive Eclipse Public License under https: //github.com/kolovos/flexmi. Flexmi also comprises a dedicated editor which provides syntax highlighting and error reporting facilities. 3 Evaluation Having presented the flexible reflective parsing algorithm in Section 2, in this section we report on the results of benchmarking the Flexmi prototype that im- plements it. Through this evaluation, we wish to assess 1) how the algorithm 4 This is a modifier provided by Ecore, which is orthogonal to the name of an EAt- tribute scales with increasing XML document sizes, and 2) whether the overhead im- posed by the algorithm is prohibitive for practical use. To conduct this evaluation, we implemented a model generator that produces different sizes of synthetic XML documents that are meant to be parsed as Ecore metamodels. We used the generator to produce models containing between 217 and 218,338 model elements (EPackages, EClasses, EAttributes, EReferences and EDataTypes), linked through both containment and non-containment refer- ences. We then transformed the generated models into an XMI representation so that we can compare the performance of the proposed flexible parsing algorithm with that of the EMF-based XMI parser in order to assess the magnitude of the overhead imposed by the nature of flexible parsing. The results obtained through our benchmarks are displayed in the line chart of Figure 2. The x-axis of the chart represents the number of model elements, while the y-axis represents the time (in miliseconds) that the parsers required in order to parse models containing these elements. We can observe that Flexmi 1) scales linearly with the size of the model, and 2) is around 5.2 times slower than the XMI parser that ships with EMF. 3,000 2,500 Loading time (in ms) 2,000 1,500 XMI 1,000 Flexmi 500 0 0 50,000 100,000 150,000 200,000 250,000 # of model elements Fig. 2. Benchmark Results: Flexmi vs. XMI 3.1 Threats to Validity For this empirical evaluation, we have used synthetic models that conform to an existing metamodel (Ecore). Conceivably, the performance of the algorithm on manually crafted models conforming to different metamodels could differ although, analytically, we see no reason for this. Evaluating the algorithm on manually crafted models is currently not feasible due to the unavailability of an existing data set, as this is a newly introduced approach to model parsing. 4 Related Work While we are not aware of prior work that addresses the problem of loading models in a flexible manner in the context of MDE, repairing and recovering from parsing errors is also studied in the XML community, where parsing is a core operation performed before an XML document can be navigated, queried, and manipulated. Generally speaking we can classify errors in XML documents in two main categories, namely well-formedness violations and structural invalidities. A well-formedness violation occurs when an XML document does not conform to the syntax production rules and the constraints defined in the XML specification [6], while structural invalidity occurs when the tree corresponding to an XML document is not accepted by the automaton associated with its schema. Verifying well-formedness of XML is straightforward to do by using a stack in linear time. Automatically repairing a malformed XML document is more challenging though. In [7] the authors propose a dynamic programming algorithm which computes the edit distance between a malformed and its corresponding well-formed document. Such approaches can be complementary to our approach, since we assume that the input XML documents are well-formed. On the other hand verifying and repairing structural invalidities is more de- manding. First the similarity between the XML document and its corresponding schema must be calculated. Then the minimum number of tree edit operations, which make the document under consideration valid with respect to the schema, is calculated. Finally, generation of edit sequence operations is performed. In [8, 9] an algorithm is presented for measuring the similarity between an XML document d and its corresponding schema D. This is done by measuring the common and the extra fragments between the two. In [10] the edit distance is defined as the minimum cost of all edit sequences transforming d into a valid document. A detailed survey of the different approaches for measuring similarity between an XML document and its schema is provided in [11]. The problem for measuring the minimum number of tree edits for making a tree valid with respect to a tree grammar is described for the first time in [12]. However, the algorithm proposed in this paper applies only to ranked trees. which is not the case of XML trees. In [13] the authors propose an algorithm for structural repairs of XML document with respect to single type tree grammars. Finally, [13, 14] propose an algorithm for automatically computing a set of edit operation sequences between an XML document and a tree grammar. 5 Conclusions In this paper, we have presented a novel flexible reflective algorithm that can parse XML documents into in-memory models conforming to an Ecore meta- model. We have also discussed an open-source implementation of the algorithm and used it to evaluate its efficiency against the XMI parser provided by EMF. The results of our empirical evaluation show that the proposed algorithm scales linearly with the size of the XML document and that its overhead of flexible parsing is not prohibitive for practical use. The main limitation of the proposed approach, in its current form, is that models loaded using Flexmi cannot be programmatically modified and persisted again as the parser does not memorise the mapping between XML element and attribute names and the model elements/structural feature values it has pro- duced from them. We plan to add such memorisation capabilities in future iter- ations of this work. We also plan to systematically investigate additional string similarity algorithms, beyond [4], and assess their suitability for flexible parsing. Acknowledgements. This work was partially supported by the European Commission, through the Scalable Modelling and Model Management on the Cloud (MONDO) FP7 STREP project (grant #611125). References 1. Ole Lehrmann Madsen and Birger Møller-Pedersen. A Unified Approach to Mod- eling and Programming. In Proceedings of the 13th International Conference on Model Driven Engineering Languages and Systems: Part I, MODELS’10, pages 1–15, Berlin, Heidelberg, 2010. Springer-Verlag. 2. Object Management Group. Human-Usable Textual Notation Specification, 2004. http://www.omg.org/spec/HUTN/1.0/. 3. Object Management Group. XML Metadata Interchange (version 2.5.1), 2015. http://www.omg.org/spec/XMI/2.5.1/. 4. V. I. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10:707–710, 1966. 5. H. W. Kuhn. The Hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2(1-2):83–97, 1955. 6. Tim Bray, Jean Paoli, C. M. Sperberg-McQueen, Eve Maler, and François Yergeau. Extensible Markup Language (XML) 1.0 (Fifth Edition). Technical Report https://www.w3.org/TR/REC-xml/, W3C, February 2013. 7. Flip Korn, Barna Saha, Divesh Srivastava, and Shanshan Ying. On repairing structural problems in semi-structured data. Proceedings of the VLDB Endowment, 6(9):601–612, 2013. 8. Elisa Bertino, Giovanna Guerrini, and Marco Mesiti. A matching algorithm for measuring the structural similarity between an xml document and a dtd and its applications. Information Systems, 29(1):23–46, 2004. 9. Elisa Bertino, Giovanna Guerrini, and Marco Mesiti. Measuring the structural similarity among XML documents and DTDs. Journal of Intelligent Information Systems, 30(1):55–92, 2008. 10. Slawomir Staworko and Jan Chomicki. Validity-sensitive querying of XML databases. In Current Trends in Database Technology–EDBT 2006, pages 164– 177. Springer, 2006. 11. Joe Tekli, Richard Chbeir, Agma Traina, and Caetano Traina. XML document- grammar comparison: related problems and applications. Open Computer Science, 1(1):117–136, 2011. 12. Utsav Boobna and Michel de Rougemont. Correctors for XML data. In Database and XML Technologies, pages 97–111. Springer, 2004. 13. Martin Svoboda and Irena Mlỳnková. Correction of Invalid XML Documents with Respect to Single Type Tree Grammars. In Networked Digital Technologies, pages 179–194. Springer, 2011. 14. Nobutaka Suzuki. Finding K Optimum Edit Scripts between an XML Document and a RegularTree Grammar. In Proc. Emerging Research Opportunities for Web Data Management, 2007. http://ceur-ws.org/Vol-229/.