A Conceptual Model for the XML Schema Evolution Overview: Storing, Base-Model-Mapping and Visualization Thomas Nösinger, Meike Klettke, Andreas Heuer Database Research Group University of Rostock, Germany (tn, meike, ah)@informatik.uni-rostock.de ABSTRACT our conceptual model called EMX (Entity Model for XML- In this article the conceptual model EMX (Entity Model Schema). for XML-Schema) for dealing with the evolution of XML A further issue, not covered in this paper, but important Schema (XSD) is introduced. The model is a simplified in the overall context of exchanging information, is the valid- representation of an XSD, which hides the complexity of ity of XML documents [5]. Modifications of XML Schema re- XSD and offers a graphical presentation. For this purpose quire adaptions of all XML documents that are valid against a unique mapping is necessary which is presented as well the former XML Schema (also known as co-evolution). as further information about the visualization and the log- One unpractical way to handle this problem is to introduce ical structure. A small example illustrates the relation- different versions of an XML Schema, but in this case all ships between an XSD and an EMX. Finally, the integration versions have to be stored and every participant of the het- into a developed research prototype for dealing with the co- erogeneous environment has to understand all different doc- evolution of corresponding XML documents is presented. ument descriptions. An alternative solution is the evolution of the XML Schema, so that just one document description exists at one time. The above mentioned validity problem 1. INTRODUCTION of XML documents is not solved, but with the standardized The eXtensible Markup Language (XML) [2] is one of the description of the adaptions (e.g. a sequence of operations most popular formats for exchanging and storing structured [8]) and by knowing a conceptual model inclusively the cor- and semi-structured information in heterogeneous environ- responding mapping to the base-model (e.g. XSD), it is ments. To assure that well-defined XML documents can be possible to derive necessary XML document transformation understood by every participant (e.g. user or application) steps automatically out of the adaptions [7]. The conceptual it is necessary to introduce a document description, which model is an essential prerequisite for the here not in detail contains information about allowed structures, constraints, but incidentally handled process of the evolution of XML data types and so on. XML Schema [4] is one commonly used Schema. standard for dealing with this problem. An XML document This paper is organized as follows. Section 2 gives the is called valid, if it fulfills all restrictions and conditions of necessary background of XML Schema and corresponding an associated XML Schema. concepts. Section 3 presents our conceptual model by first XML Schema that have been used for years have to be giving a formal definition (3.1), followed by the specification modified from time to time. The main reason is that the of the unique mapping between EMX and XSD (3.2) and requirements for exchanged information can change. To the logical structure of the EMX (3.3). After introducing meet these requirements the schema has to be adapted, for the conceptual model we present an example of an EMX in example if additional elements are added into an existing section 4. In section 5 we describe the practical use of content model, the data type of information changed or in- EMX in our prototype, which was developed for handle the tegrity constraints are introduced. All in all every possi- co-evolution of XML Schema and XML documents. Finally ble structure of an XML Schema definition (XSD) can be in section 6 we draw our conclusions. changed. A question occurs: In which way can somebody make these adaptions without being coerced to understand 2. TECHNICAL BACKGROUND and deal with the whole complexity of an XSD? One solu- In this section we present a common notation used in the tion is the definition of a conceptual model for simplifying rest of the paper. At first we will shortly introduce the the base-model; in this paper we outline further details of abstract data model (ADM) and element information item (EII) of XML Schema, before further details concerning dif- ferent modeling styles are given. The XML Schema abstract data model consists of different components or node types1 , basically these are: type defi- nition components (simple and complex types), declaration components (elements and attributes), model group compo- nents, constraint components, group definition components 25th GI-Workshop on Foundations of Databases (Grundlagen von Daten- 1 banken), 28.05.2013 - 31.05.2013, Ilmenau, Germany. An XML Schema can be visualized as a directed graph with Copyright is held by the author/owner(s). different nodes (components); an edge realizes the hierarchy and annotation components [3]. Additionally the element against exchanged information change and the underlying information item exists, an XML representation of these schema has to be adapted then this modeling style is the components. The element information item defines which most suitable. The advantage of the Garden of Eden style content and attributes can be used in an XML Schema. Ta- is that all components can be easily identified by knowing ble 1 gives an overview about the most important compo- the QNAME (qualified name). Furthermore the position of nents and their concrete representation. The , components within an XML Schema is obvious. A qualified name is a colon separated string of the target namespace of ADM Element Information Item the XML Schema followed by the name of the declaration declarations , or definition. The name of a declaration and definition is group-definitions a string of the data type NCNAME (non-colonized name), model-groups , , , a string without colons. The Garden of Eden style is the , basic modeling style which is considered in this paper, a type-definitions , transformation between different styles is possible.2 N.N. , , 3. CONCEPTUAL MODEL , In [7] the three layer architecture for dealing with XML annotations Schema adaptions (i.e. the XML Schema evolution) was constraints , , , introduced and the correlations between them were men- , tioned. An overview is illustrated in figure 2. The first N.N. EMX Table 1: XML Schema Information Items EMX Operation EMX‘ , and are not explicitly 1 - 1 mapping given in the abstract data model (N.N. - Not Named), but Schema XML they are important components for embedding externally defined XML Schema (esp. element declarations, attribute XSD Operation XSD‘ declarations and type definitions). In the rest of the pa- 1 - * mapping per, we will summarize them under the node type ”module”. XML The ”is the document (root) element of any W3C Operation XML Schema. It’s both a container for all the declarations XML XML‘ and definitions of the schema and a place holder for a number of default values expressed as attributes” [9]. Analyzing the possibilities of specifying declarations and definitions leads Figure 2: Three Layer Architecture to four different modeling styles of XML Schema, these are: Russian Doll, Salami Slice, Venetian Blind and Garden of layer is our conceptual model EMX (Entity Model for XML- Eden [6]. These modeling styles influence mainly the re- Schema), a simplified representation of the second layer. usability of element declarations or defined data types and This layer is the XML Schema (XSD), where a unique map- also the flexibility of an XML Schema in general. Figure ping between these layers exists. The mapping is one of the 1 summarizes the modeling styles with their scopes. The main aspects of this paper (see section 3.2). The third layer are XML documents or instances, an ambiguous mapping between XSD and XML documents exists. It is ambiguous because of the optionality of structures (e.g. minOccurs = Garden of Eden Venetian Blind ’0’; use = ’optional’) or content types (e.g. ). The Russian Doll Salami Slice third layer and the mapping between layer two and three, as well as the operations for transforming the different layers are not covered in this paper (parts were published in [7]). Scope element and attribute local x x 3.1 Formal Definition declaration global x x The conceptual model EMX is a triplet of nodes (NM ), local x x directed edges between nodes (EM ) and features (FM ). type definition global x x EM X = (NM , EM , FM ) (1) Figure 1: XSD Modeling Styles according to [6] Nodes are separated in simple types (st), complex types (ct), elements, attribute-groups, groups (e.g. content model), an- scope of element and attribute declarations as well as the notations, constraints and modules (i.e. externally managed scope of type definitions is global iff the corresponding node XML Schemas). Every node has under consideration of the is specified as a child of the and can be referenced element information item of a corresponding XSD different (by knowing e.g. the name and namespace). Locally speci- attributes, e.g. an element node has a name, occurrence fied nodes are in contrast not directly under , the values, type information, etc. One of the most important re-usability is not given respectively not possible. 2 A student thesis to address the issue of converting different An XML Schema in the Garden of Eden style just con- modeling styles into each other is in progress at our profes- tains global declarations and definitions. If the requirements sorship; this is not covered in this paper. attributes of every node is the EID (EMX ID), a unique visualized by adding a ”blue W in a circle”, a similar be- identification value for referencing and localization of every haviour takes place if an attribute wildcard is given in an node; an EID is one-to-one in every EMX. The directed . edges are defined between nodes by using the EIDs, i.e. ev- The type-definitions are not directly visualized in an EMX. ery edge is a pair of EID values from a source to a tar- Simple types for example can be specified and afterwards be get. The direction defines the include property, which was referenced by elements or attributes 3 by using the EID of the specified under consideration of the possibilities of an XML corresponding EMX node. The complex type is also implic- Schema. For example if a model-group of the abstract data itly given, the type will be automatically derived from the model (i.e. an EMX group with ”EID = 1”) contains dif- structure of the EMX after finishing the modeling process. ferent elements (e.g. EID = {2,3}), then two edges exist: The XML Schema specification 1.1 has introduced different (1,2) and (1,3). In section 3.3 further details about allowed logical constraints, which are also integrated in the EMX. edges are specified (see also figure 5). The additional fea- These are the EIIs (for constraints on complex tures allow the user-specific setting of the overall process types) and . An is under consider- of co-evolution. It is not only possible to specify default ation of the specification a facet of a restricted simple type values but also to configure the general behaviour of opera- [4]. The last EII is , this ”root” is an EMX itself. tions (e.g. only capacity-preserving operations are allowed). This is also the reason why further information or properties Furthermore all XML Schema properties of the element in- of an XML Schema are stored in the additional features as formation item are included in the additional mentioned above. features. The additional features are not covered in this paper. 3.3 Logical Structure After introducing the conceptual model and specifying the 3.2 Mapping between XSD and EMX mapping between an EMX and XSD, in the following section An overview about the components of an XSD has been details about the logical structure (i.e. the storing model) given in table 1. In the following section the unique map- are given. Also details about the valid edges of an EMX are ping between these XSD components and the EMX nodes illustrated. Figure 3 gives an overview about the different introduced in section 3.1 is specified. Table 2 summarizes relations used as well as the relationships between them. the mapping. For every element information item (EII) the The logical structure is the direct consequence of the used EII EMX Node Visualization element Path Constraint ST_List Facet Annotation , attribute- group Attribute Assert Element ST Attribute , , group _Ref Element_ Attribute Attribute CT Schema Ref _Gr _Gr_Ref @ @ st implicit and @ specifiable Group Wildcard Module ct implicit and derived EMX visualized extern @ Attribute , module Relation parent_EIDparent_EID has_asame @ node in EMX Element , , Figure 3: Logical Structure of an EMX annotation modeling style Garden of Eden, e.g. elements are either , , constraint element declarations or element references. That’s why this separation is also made in the EMX. implicit in ct All in all there are 18 relations, which store the content of restriction in st an XML Schema and form the base of an EMX. The different the EMX itself nodes reference each other by using the well known foreign key constraints of relational databases. This is expressed by Table 2: Mapping and Visualization using the directed ”parent EID” arrows, e.g. the EMX nodes (”rectangle with thick line”) element, st, ct, attribute-group corresponding EMX node is given as well as the assigned vi- and modules reference the ”Schema” itself. If declarations sualization. For example an EMX node group represents the or definitions are externally defined then the ”parent EID” abstract data model (ADM) node model-group (see table 1). is the EID of the corresponding module (”blue arrow”). The This ADM node is visualized through the EII content mod- ”Schema” relation is an EMX respectively the root of an els , and , and the wildcards EMX as already mentioned above. and . In an EMX the visualization of a group is the blue ”triangle with a G” in it. Further- 3 The EII and are the same more if a group contains an element wildcard then this is in the EMX, an attribute-group is always a container The ”Annotation” relation can reference every other re- specified under consideration of the XML Schema specifica- lation according to the XML Schema specification. Wild- tion [4], e.g. an element declaration needs a ”name” and a cards are realized as an element wildcard, which belongs to type (”type EID” as a foreign key) as well as other optional a ”Group” (i.e. EII ), or they can be attribute wild- values like the final (”finalV”), default (”defaultV”), ”fixed”, cards which belongs to a ”CT” or ”Attribute Gr” (i.e. EII ”nillable”, XML Schema ”id” or ”form” value. Other EMX ). Every ”Element” relation (i.e. element specific attributes are also given, e.g. the ”file ID” and the declaration) has either a simple type or a complex type, ”parent EID” (see figure 3). The element references have a and every ”Element Ref” relation has an element declara- ”ref EID”, which is a foreign key to a given element declara- tion. Attributes and attribute-groups are the same in an tion. Moreover attributes of the occurrence (”minOccurs”, EMX, as mentioned above. ”maxOccurs”), the ”position” in a content model and the Moreover figure 3 illustrates the distinction between visu- XML Schema ”id” are stored. Element references are visual- alized (”yellow border”) and not visualized relations. Under ized in an EMX. That’s why some values about the position consideration of table 2 six relations are direct visible in in an EMX are stored, i.e. the coordinates (”x Pos”, ”y Pos”) an EMX: constraints, annotations, modules, groups and be- and the ”width” and ”height” of an EMX node. The same cause of the Garden of Eden style element references and position attributes are given in every other visualized EMX attribute-group references. Table 3 summarizes which rela- node. tion of figure 3 belongs to which EMX node of table 2. The edges of the formal definition of an EMX can be de- rived by knowing the logical structure and the visualization EMX Node Relation of an EMX. Figure 5 illustrates the allowed edges of EMX element Element, Element Ref nodes. An edge is always a pair of EIDs, from a source attribute-group Attribute, Atttribute Ref, Attribute Gr, attribute-group Attribute Gr Ref source X annotation group Group, Wildcard constraint edge(X,Y) element schema module st ST, ST List, Facet group ct CT ct st annotation Annotation target Y constraint Contraint, Path, Assert element x x x module Module attribute-group x x x x group x x x Table 3: EMX Nodes with Logical Structure ct x x x st x x x x annotation x x x x x x x x The EMX node st (i.e. simple type) has three relations. constraint x x x These are the relation ”ST” for the most simple types, the re- module x lation ”ST List” for set free storing of simple union types and implicitly given the relation ”Facet” for storing facets of a simple restriction type. Constraints are realized through the relation ”Path” Figure 5: Allowed Edges of EMX Nodes for storing all used XPath statements for the element infor- mation items (EII) , and and (”X”) to a target (”Y”). For example it is possible to add the relation ”Constraint” for general properties e.g. name, an edge outgoing from an element node to an annotation, XML Schema id, visualization information, etc. Further- constraint, st or ct. A ”black cross” in the figure defines a more the relation ”Assert” is used for storing logical con- possible edge. If an EMX is visualized then not all EMX straints against complex types (i.e. EII ) and sim- nodes are explicitly given, e.g. the type-definitions of the ple types (i.e. EII ). Figure 4 illustrates the abstract data model (i.e. EMX nodes st, ct; see table 2). In this case the corresponding ”black cross” has to be moved element element_ref along the given ”yellow arrow”, i.e. an edge in an EMX be- PK EID PK EID tween a ct (source) and an attribute-group (target) is valid. name FK ref_EID If this EMX is visualized, then the attribute-group is shown FK type_EID minOccurs finalV maxOccurs as a child of the group which belongs to above mentioned defaultV position ct. Some information are just ”implicitly given” in a visu- fixed id nillable FK file_ID alization of an EMX (e.g. simple types). A ”yellow arrow” id FK parent_EID which starts and ends in the same field is a hint for an union form width FK file_ID height of different nodes into one node, e.g. if a group contains a FK parent_EID x_Pos y_Pos wildcard then in the visualization only the group node is visible (extended with the ”blue W”; see table 2). Figure 4: Relations of EMX Node element 4. EXAMPLE stored information concerning the EMX node element re- In section 3 the conceptual model EMX was introduced. spectively the relations ”Element” and ”Element Ref”. Both In the following section an example is given. Figure 6 il- relations have in common, that every tuple is identified by lustrates an XML Schema in the Garden of Eden modeling using the primary key EID. The EID is one-to-one in ev- style. An event is specified, which contains a place (”ort”) ery EMX as mentioned above. The other attributes are and an id (”event-id”). Furthermore the integration of other the connection without ”black rectangle”, the target is the other side. For example the given annotation is a child of the element ”event” and not the other way round; an element can never be a child of an annotation, neither in the XML Schema specification nor in the EMX. The logical structure of the EMX of figure 7 is illustrated in figure 8. The relations of the EMX nodes are given as well Schema EID xmlns_xs targetName TNPrefix 1 http://www.w3.org/2001/XMlSchema gvd2013.xsd eve Element Annotation EID name type_EID parent_EID EID parent_EID x_Pos y_Pos 2 event 14 1 10 2 50 100 3 name 11 1 Wildcard 4 datum 12 1 EID parent_EID 5 ort 13 1 17 14 Element_Ref EID ref_EID minOccurs maxOccurs parent_EID x_Pos y_Pos 6 2 1 1 1 75 75 event Figure 6: XML Schema in Garden of Eden Style 7 3 1 1 16 60 175 name 8 4 1 1 16 150 175 datum 9 5 1 1 15 100 125 ort attributes is possible, because of an attribute wildcard in ST CT the respective complex type. The place is a sequence of a EID name mode parent_EID EID name parent_EID name and a date (”datum”). 11 xs:string built-in 1 13 orttype 1 All type definitions (NCNAME s: ”orttype”, ”eventtype”) 12 xs:date built-in 1 14 eventtype 1 and declarations (NCNAME s: ”event”, ”name”, ”datum”, Group EID mode parent_EID x_Pos y_Pos ”ort” and the attribute ”event-id”) are globally specified. 15 sequence 14 125 100 eventsequence The target namespace is ”eve”, so the QNAME of e.g. the 16 sequence 13 100 150 ortsequence complex type definition ”orttype” is ”eve:orttype”. By using Attribute Attribute_Ref the QNAME every above mentioned definition and decla- EID name parent_EID EID ref_EID parent_EID ration can be referenced, so the re-usability of all compo- 18 event-id 1 19 18 14 nents is given. Furthermore an attribute wildcard is also Attribute_Gr Attribute_Gr_Ref specified, i.e. the complex type ”eventtype” contains apart EID parent_EID EID ref_EID parent_EID x_Pos y_Pos from the content model sequence and the attribute refer- 20 1 21 20 14 185 125 ence ”eve:event-id” the element information item . Figure 8: Logical Structure of Figure 7 Figure 7 is the corresponding EMX of the above specified XML Schema. The representation is an obvious simplifica- as the attributes and corresponding values relevant for the example. Next to every tuple of the relations ”Element Ref” and ”Group” small hints which tuples are defined are added (for increasing the readability). It is obvious that an EID has to be unique, this is a prerequisite for the logical struc- ture. An EID is created automatically, a user of the EMX can neither influence nor manipulate it. The element references contain information about the oc- currence (”minOccurs”, ”maxOccurs”), which are not explic- itly given in the XSD of figure 6. The XML Schema spec- ification defines default values in such cases. If an element reference does not specify the occurrence values then the standard value ”1” is used; an element reference is obliga- tory. These default values are also added automatically. Figure 7: EMX to XSD of Figure 6 The stored names of element declarations are NCNAME s, but by knowing the target namespace of the corresponding tion, it just contains eight well arranged EMX nodes. These schema (i.e. ”eve”) the QNAME can be derived. The name are the elements ”event”, ”ort”, ”name” and ”datum”, an an- of a type definition is also the NCNAME, but if e.g. a built- notation as a child of ”event”, the groups as a child under in type is specified then the name is the QNAME of the ”event” and ”ort”, as well as an attribute-group with wild- XML Schema specification (”xs:string”, ”xs:date”). card. The simple types of the element references ”name” and ”datum” are implicitly given and not visualized. The complex types can be derived by identifying the elements 5. PRACTICAL USE OF EMX which have no specified simple type but groups as a child The co-evolution of XML documents was already men- (i.e. ”event” and ”ort”). tioned in section 1. At the University of Rostock a research The edges are under consideration of figure 5 pairs of not prototype for dealing with this co-evolution was developed: visualized, internally defined EIDs. The source is the side of CodeX (Conceptual design and evolution for XML Schema) [5]. The idea behind it is simple and straightforward at the modeled XSD and an EMX, so it is possible to representa- same time: Take an XML Schema, transform it to the specif- tively adapt or modify the conceptual model instead of the ically developed conceptual model (EMX - Entity Model for XML Schema. XML-Schema), change the simplified conceptual model in- This article presents the formal definition of an EMX, all stead of dealing with the whole complexity of XML Schema, in all there are different nodes, which are connected by di- collect these changing information (i.e. the user interaction rected edges. Thereby the abstract data model and element with EMX) and use them to create automatically trans- information item of the XML Schema specification were con- formation steps for adapting the XML documents (by us- sidered, also the allowed edges are specified according to ing XSLT - Extensible Stylesheet Language Transformations the specification. In general the most important compo- [1]). The mapping between EMX and XSD is unique, so it is nents of an XSD are represented in an EMX, e.g. elements, possible to describe modifications not only on the EMX but attributes, simple types, complex types, annotations, con- also on the XSD. The transformation and logging language strains, model groups and group definitions. Furthermore ELaX (Evolution Language for XML-Schema [8]) is used to the logical structure is presented, which defines not only the unify the internally collected information as well as intro- underlying storing relations but also the relationships be- duce an interface for dealing directly with XML Schema. tween them. The visualization of an EMX is also defined: Figure 9 illustrates the component model of CodeX, firstly outgoing from 18 relations in the logical structure, there are published in [7] but now extended with the ELaX interface. eight EMX nodes in the conceptual model, from which six are visualized. Results Our conceptual model is an essential prerequisite for the prototype CodeX (Conceptual design and evolution for XML GUI Schema modifications ELaX Data supply Schema) as well as for the above mentioned co-evolution. A Visualization ELaX Import Export remaining step is the finalization of the implementation in XSD CodeX. After this work an evaluation of the usability of the Evolution engine XSD Config XML XSLT XSD Config conceptual model is planned. Nevertheless we are confident, that the usage is straightforward and the simplification of EMX in comparison to deal with the whole complexity of Model Spezification Configuration XML an XML Schema itself is huge. mapping of operation documents Update notes & Model data Evolution spezific data evolution results Knowledge Transformation 7. REFERENCES base Log [1] XSL Transformations (XSLT) Version 2.0. CodeX http://www.w3.org/TR/2007/REC-xslt20-20070123/, January 2007. Online; accessed 26-March-2013. Figure 9: Component Model of CodeX [5] [2] Extensible Markup Language (XML) 1.0 (Fifth Edition). The component model illustrates the different parts for http://www.w3.org/TR/2008/REC-xml-20081126/, dealing with the co-evolution. The main parts are an im- November 2008. Online; accessed 26-March-2013. port and export component for collecting and providing data [3] XQuery 1.0 and XPath 2.0 Data Model (XDM) of e.g. a user (XML Schemas, configuration files, XML doc- (Second Edition). http://www.w3.org/TR/2010/ ument collections, XSLT files), a knowledge base for stor- REC-xpath-datamodel-20101214/, December 2010. ing information (model data, evolution specific data and Online; accessed 26-March-2013. co-evolution results) and especially the logged ELaX state- [4] W3C XML Schema Definition Language (XSD) 1.1 ments (”Log”). The mapping information between XSD and Part 1: Structures. http://www.w3.org/TR/2012/ EMX of table 2 are specified in the ”Model data” component. REC-xmlschema11-1-20120405/, April 2012. Online; Furthermore the CodeX prototype also provides a graph- accessed 26-March-2013. ical user interface (”GUI”), a visualization component for [5] M. Klettke. Conceptual XML Schema Evolution - the the conceptual model and an evolution engine, in which the CoDEX Approach for Design and Redesign. In BTW transformations are derived. The visualization component Workshops, pages 53–63, 2007. realizes the visualization of an EMX introduced in table 2. [6] E. Maler. Schema design rules for ubl...and maybe for The ELaX interface for modifying imported XML Schemas you. In XML 2002 Proceedings by deepX, 2002. communicates directly with the evolution engine. [7] T. Nösinger, M. Klettke, and A. Heuer. Evolution von XML-Schemata auf konzeptioneller Ebene - Übersicht: 6. CONCLUSION Der CodeX-Ansatz zur Lösung des Gültigkeitsproblems. Valid XML documents need e.g. an XML Schema, which In Grundlagen von Datenbanken, pages 29–34, 2012. restricts the possibilities and usage of declarations, defini- [8] T. Nösinger, M. Klettke, and A. Heuer. Automatisierte tions and structures in general. In a heterogeneous changing Modelladaptionen durch Evolution - (R)ELaX in the environment (e.g. an information exchange scenario), also Garden of Eden. Technical Report CS-01-13, Institut ”old” and longtime used XML Schema have to be modified für Informatik, Universität Rostock, Rostock, Germany, to meet new requirements and to be up-to-date. Jan. 2013. Published as technical report CS-01-13 EMX (Entity Model for XML-Schema) as a conceptual under ISSN 0944-5900. model is a simplified representation of an XSD, which hides [9] E. van der Vlist. XML Schema. O’Reilly & Associates, its complexity and offers a graphical presentation. A unique Inc., 2002. mapping exists between every in the Garden of Eden style