1. INTRODUCTION

A Conceptual Model for the XML Schema Evolution

Thomas Nösinger

Meike Klettke

meike@informatik.uni-rostock.de 0

Andreas Heuer Database Research Group

0 0 University of Rostock , Germany (tn, meike

2013

In this article the conceptual model EMX (Entity Model for XML-Schema) for dealing with the evolution of XML Schema (XSD) is introduced. The model is a simpli ed representation of an XSD, which hides the complexity of XSD and o ers a graphical presentation. For this purpose a unique mapping is necessary which is presented as well as further information about the visualization and the logical structure. A small example illustrates the relationships between an XSD and an EMX. Finally, the integration into a developed research prototype for dealing with the coevolution of corresponding XML documents is presented.

1. INTRODUCTION

The eXtensible Markup Language (XML) [ 2 ] is one of the most popular formats for exchanging and storing structured and semi-structured information in heterogeneous environments. To assure that well-de ned XML documents can be understood by every participant (e.g. user or application) it is necessary to introduce a document description, which contains information about allowed structures, constraints, data types and so on. XML Schema [ 4 ] is one commonly used standard for dealing with this problem. An XML document is called valid, if it ful lls all restrictions and conditions of an associated XML Schema.

XML Schema that have been used for years have to be modi ed from time to time. The main reason is that the requirements for exchanged information can change. To meet these requirements the schema has to be adapted, for example if additional elements are added into an existing content model, the data type of information changed or integrity constraints are introduced. All in all every possible structure of an XML Schema de nition (XSD) can be changed. A question occurs: In which way can somebody make these adaptions without being coerced to understand and deal with the whole complexity of an XSD? One solution is the de nition of a conceptual model for simplifying the base-model; in this paper we outline further details of our conceptual model called EMX (Entity Model for XMLSchema).

A further issue, not covered in this paper, but important in the overall context of exchanging information, is the validity of XML documents [ 5 ]. Modi cations of XML Schema require adaptions of all XML documents that are valid against the former XML Schema (also known as co-evolution).

One unpractical way to handle this problem is to introduce di erent versions of an XML Schema, but in this case all versions have to be stored and every participant of the heterogeneous environment has to understand all di erent document descriptions. An alternative solution is the evolution of the XML Schema, so that just one document description exists at one time. The above mentioned validity problem of XML documents is not solved, but with the standardized description of the adaptions (e.g. a sequence of operations [ 8 ]) and by knowing a conceptual model inclusively the corresponding mapping to the base-model (e.g. XSD), it is possible to derive necessary XML document transformation steps automatically out of the adaptions [ 7 ]. The conceptual model is an essential prerequisite for the here not in detail but incidentally handled process of the evolution of XML Schema.

This paper is organized as follows. Section 2 gives the necessary background of XML Schema and corresponding concepts. Section 3 presents our conceptual model by rst giving a formal de nition (3.1), followed by the speci cation of the unique mapping between EMX and XSD (3.2) and the logical structure of the EMX (3.3). After introducing the conceptual model we present an example of an EMX in section 4. In section 5 we describe the practical use of EMX in our prototype, which was developed for handle the co-evolution of XML Schema and XML documents. Finally in section 6 we draw our conclusions. 2.

TECHNICAL BACKGROUND

In this section we present a common notation used in the rest of the paper. At rst we will shortly introduce the abstract data model (ADM) and element information item (EII) of XML Schema, before further details concerning different modeling styles are given.

The XML Schema abstract data model consists of di erent components or node types1, basically these are: type de nition components (simple and complex types), declaration components (elements and attributes), model group components, constraint components, group de nition components 1An XML Schema can be visualized as a directed graph with di erent nodes (components); an edge realizes the hierarchy and annotation components [ 3 ]. Additionally the element information item exists, an XML representation of these components. The element information item de nes which content and attributes can be used in an XML Schema. Table 1 gives an overview about the most important components and their concrete representation. The <include>,

ADM declarations group-de nitions model-groups type-de nitions annotations constraints <import>, <rede ne> and <overwrite> are not explicitly given in the abstract data model (N.N. - Not Named), but they are important components for embedding externally de ned XML Schema (esp. element declarations, attribute declarations and type de nitions). In the rest of the paper, we will summarize them under the node type "module". The <schema> "is the document (root) element of any W3C XML Schema. It's both a container for all the declarations and de nitions of the schema and a place holder for a number of default values expressed as attributes" [ 9 ]. Analyzing the possibilities of specifying declarations and de nitions leads to four di erent modeling styles of XML Schema, these are: Russian Doll, Salami Slice, Venetian Blind and Garden of Eden [ 6 ]. These modeling styles in uence mainly the reusability of element declarations or de ned data types and also the exibility of an XML Schema in general. Figure 1 summarizes the modeling styles with their scopes. The element and attribute

declaration type definition

Scope local global local global x x l l o D n a i s s u R x x scope of element and attribute declarations as well as the scope of type de nitions is global i the corresponding node is speci ed as a child of the <schema> and can be referenced (by knowing e.g. the name and namespace). Locally specied nodes are in contrast not directly under <schema>, the re-usability is not given respectively not possible.

An XML Schema in the Garden of Eden style just contains global declarations and de nitions. If the requirements against exchanged information change and the underlying schema has to be adapted then this modeling style is the most suitable. The advantage of the Garden of Eden style is that all components can be easily identi ed by knowing the QNAME (quali ed name). Furthermore the position of components within an XML Schema is obvious. A quali ed name is a colon separated string of the target namespace of the XML Schema followed by the name of the declaration or de nition. The name of a declaration and de nition is a string of the data type NCNAME (non-colonized name), a string without colons. The Garden of Eden style is the basic modeling style which is considered in this paper, a transformation between di erent styles is possible.2 3.

CONCEPTUAL MODEL

In [ 7 ] the three layer architecture for dealing with XML Schema adaptions (i.e. the XML Schema evolution) was introduced and the correlations between them were mentioned. An overview is illustrated in gure 2. The rst X M E

a L m XM ceh

S L M X

EMX

1 - 1 mapping

XSD

1 - * mapping

XML

Operation Operation Operation

EMX‘ XSD‘ XML‘ layer is our conceptual model EMX (Entity Model for XMLSchema), a simpli ed representation of the second layer. This layer is the XML Schema (XSD), where a unique mapping between these layers exists. The mapping is one of the main aspects of this paper (see section 3.2). The third layer are XML documents or instances, an ambiguous mapping between XSD and XML documents exists. It is ambiguous because of the optionality of structures (e.g. minOccurs = '0'; use = 'optional') or content types (e.g. <choice>). The third layer and the mapping between layer two and three, as well as the operations for transforming the di erent layers are not covered in this paper (parts were published in [ 7 ]). 2A student thesis to address the issue of converting di erent modeling styles into each other is in progress at our professorship; this is not covered in this paper. attributes of every node is the EID (EMX ID), a unique identi cation value for referencing and localization of every node; an EID is one-to-one in every EMX. The directed edges are de ned between nodes by using the EIDs, i.e. every edge is a pair of EID values from a source to a target. The direction de nes the include property, which was speci ed under consideration of the possibilities of an XML Schema. For example if a model-group of the abstract data model (i.e. an EMX group with "EID = 1") contains different elements (e.g. EID = f2,3g), then two edges exist: (1,2) and (1,3). In section 3.3 further details about allowed edges are speci ed (see also gure 5). The additional features allow the user-speci c setting of the overall process of co-evolution. It is not only possible to specify default values but also to con gure the general behaviour of operations (e.g. only capacity-preserving operations are allowed). Furthermore all XML Schema properties of the element information item <schema> are included in the additional features. The additional features are not covered in this paper. 3.2

Mapping between XSD and EMX

An overview about the components of an XSD has been given in table 1. In the following section the unique mapping between these XSD components and the EMX nodes introduced in section 3.1 is speci ed. Table 2 summarizes the mapping. For every element information item (EII) the EMX Node

Visualization extern @ At ribute visualized parent_EIDparent_EIDhas_asame @ Element in EMX modeling style Garden of Eden, e.g. elements are either element declarations or element references. That's why this separation is also made in the EMX.

All in all there are 18 relations, which store the content of an XML Schema and form the base of an EMX. The di erent nodes reference each other by using the well known foreign key constraints of relational databases. This is expressed by using the directed "parent EID" arrows, e.g. the EMX nodes ("rectangle with thick line") element, st, ct, attribute-group and modules reference the "Schema" itself. If declarations or de nitions are externally de ned then the "parent EID" is the EID of the corresponding module ("blue arrow"). The "Schema" relation is an EMX respectively the root of an EMX as already mentioned above. 3The EII <attribute> and <attributeGroup> are the same in the EMX, an attribute-group is always a container visualized by adding a "blue W in a circle", a similar behaviour takes place if an attribute wildcard is given in an <attributeGroup>.

The type-de nitions are not directly visualized in an EMX. Simple types for example can be speci ed and afterwards be referenced by elements or attributes3 by using the EID of the corresponding EMX node. The complex type is also implicitly given, the type will be automatically derived from the structure of the EMX after nishing the modeling process. The XML Schema speci cation 1.1 has introduced di erent logical constraints, which are also integrated in the EMX. These are the EIIs <assert> (for constraints on complex types) and <assertion>. An <assertion> is under consideration of the speci cation a facet of a restricted simple type [ 4 ]. The last EII is <schema>, this "root" is an EMX itself. This is also the reason why further information or properties of an XML Schema are stored in the additional features as mentioned above. 3.3

Logical Structure

After introducing the conceptual model and specifying the mapping between an EMX and XSD, in the following section details about the logical structure (i.e. the storing model) are given. Also details about the valid edges of an EMX are illustrated. Figure 3 gives an overview about the di erent relations used as well as the relationships between them. The logical structure is the direct consequence of the used

Constraint ST_List Facet Annotation Element ST Schema Wildcard Attribute Attribute _Gr Module Attribute _Ref Attribute _Gr_Ref

EII <element> <attribute>, <attributeGroup> <all>, <choice>, <sequence>

<any> <anyAttribute> <simpleType> <complexType> <include>, <import>, <rede ne>, <overwrite> <annotation> <key>, <unique>, <keyref> <assert> <assertion> <schema> element attributegroup group st ct module annotation constraint implicit in ct restriction in st the EMX itself corresponding EMX node is given as well as the assigned visualization. For example an EMX node group represents the abstract data model (ADM) node model-group (see table 1). This ADM node is visualized through the EII content models <all>, <choice> and <sequence>, and the wildcards <any> and <anyAttribute>. In an EMX the visualization of a group is the blue "triangle with a G" in it. Furthermore if a group contains an element wildcard then this is implicit and speci able implicit and derived

Path Assert Element_ Ref Group Relation

EMX node

The "Annotation" relation can reference every other relation according to the XML Schema speci cation. Wildcards are realized as an element wildcard, which belongs to a "Group" (i.e. EII <any>), or they can be attribute wildcards which belongs to a "CT" or "Attribute Gr" (i.e. EII <anyAttribute>). Every "Element" relation (i.e. element declaration) has either a simple type or a complex type, and every "Element Ref" relation has an element declaration. Attributes and attribute-groups are the same in an EMX, as mentioned above.

Moreover gure 3 illustrates the distinction between visualized ("yellow border") and not visualized relations. Under consideration of table 2 six relations are direct visible in an EMX: constraints, annotations, modules, groups and because of the Garden of Eden style element references and attribute-group references. Table 3 summarizes which relation of gure 3 belongs to which EMX node of table 2.

EMX Node

element attribute-group group st ct annotation constraint module

Relation

Element, Element Ref Attribute, Atttribute Ref,

Attribute Gr, Attribute Gr Ref Group, Wildcard ST, ST List, Facet

Annotation Contraint, Path, Assert

Module The EMX node st (i.e. simple type) has three relations. These are the relation "ST" for the most simple types, the relation "ST List" for set free storing of simple union types and the relation "Facet" for storing facets of a simple restriction type. Constraints are realized through the relation "Path" for storing all used XPath statements for the element information items (EII) <key>, <unique> and <keyref> and the relation "Constraint" for general properties e.g. name, XML Schema id, visualization information, etc. Furthermore the relation "Assert" is used for storing logical constraints against complex types (i.e. EII <assert>) and simple types (i.e. EII <assertion>). Figure 4 illustrates the stored information concerning the EMX node element respectively the relations "Element" and "Element Ref". Both relations have in common, that every tuple is identi ed by using the primary key EID. The EID is one-to-one in every EMX as mentioned above. The other attributes are speci ed under consideration of the XML Schema speci cation [ 4 ], e.g. an element declaration needs a "name" and a type ("type EID" as a foreign key) as well as other optional values like the nal (" nalV"), default ("defaultV"), " xed", "nillable", XML Schema "id" or "form" value. Other EMX speci c attributes are also given, e.g. the " le ID" and the "parent EID" (see gure 3). The element references have a "ref EID", which is a foreign key to a given element declaration. Moreover attributes of the occurrence ("minOccurs", "maxOccurs"), the "position" in a content model and the XML Schema "id" are stored. Element references are visualized in an EMX. That's why some values about the position in an EMX are stored, i.e. the coordinates ("x Pos", "y Pos") and the "width" and "height" of an EMX node. The same position attributes are given in every other visualized EMX node.

The edges of the formal de nition of an EMX can be derived by knowing the logical structure and the visualization of an EMX. Figure 5 illustrates the allowed edges of EMX nodes. An edge is always a pair of EIDs, from a source edge(X,Y) target Y element attribute-group group ct st annotation constraint module implicitly given p u rsceuoX teenm i-ttrrgebuo l t e a p u rgo ct st x x x x x x x x x x x x x x x x x ittaannnoo ittrscanno leudom scaehm x x x x x x x x x x x x ("X") to a target ("Y"). For example it is possible to add an edge outgoing from an element node to an annotation, constraint, st or ct. A "black cross" in the gure de nes a possible edge. If an EMX is visualized then not all EMX nodes are explicitly given, e.g. the type-de nitions of the abstract data model (i.e. EMX nodes st, ct; see table 2). In this case the corresponding "black cross" has to be moved along the given "yellow arrow", i.e. an edge in an EMX between a ct (source) and an attribute-group (target) is valid. If this EMX is visualized, then the attribute-group is shown as a child of the group which belongs to above mentioned ct. Some information are just "implicitly given" in a visualization of an EMX (e.g. simple types). A "yellow arrow" which starts and ends in the same eld is a hint for an union of di erent nodes into one node, e.g. if a group contains a wildcard then in the visualization only the group node is visible (extended with the "blue W"; see table 2).

4. EXAMPLE

In section 3 the conceptual model EMX was introduced. In the following section an example is given. Figure 6 illustrates an XML Schema in the Garden of Eden modeling style. An event is speci ed, which contains a place ("ort") and an id ("event-id"). Furthermore the integration of other attributes is possible, because of an attribute wildcard in the respective complex type. The place is a sequence of a name and a date ("datum").

All type de nitions (NCNAME s: "orttype", "eventtype") and declarations (NCNAME s: "event", "name", "datum", "ort" and the attribute "event-id") are globally speci ed. The target namespace is "eve", so the QNAME of e.g. the complex type de nition "orttype" is "eve:orttype". By using the QNAME every above mentioned de nition and declaration can be referenced, so the re-usability of all components is given. Furthermore an attribute wildcard is also speci ed, i.e. the complex type "eventtype" contains apart from the content model sequence and the attribute reference "eve:event-id" the element information item <anyAttribute>.

Figure 7 is the corresponding EMX of the above speci ed XML Schema. The representation is an obvious simpli cation, it just contains eight well arranged EMX nodes. These are the elements "event", "ort", "name" and "datum", an annotation as a child of "event", the groups as a child under "event" and "ort", as well as an attribute-group with wildcard. The simple types of the element references "name" and "datum" are implicitly given and not visualized. The complex types can be derived by identifying the elements which have no speci ed simple type but groups as a child (i.e. "event" and "ort").

The edges are under consideration of gure 5 pairs of not visualized, internally de ned EIDs. The source is the side of the connection without "black rectangle", the target is the other side. For example the given annotation is a child of the element "event" and not the other way round; an element can never be a child of an annotation, neither in the XML Schema speci cation nor in the EMX.

The logical structure of the EMX of gure 7 is illustrated in gure 8. The relations of the EMX nodes are given as well xmlns_xs targetName TNPrefix http://www.w3.org/2001/XMlSchema gvd2013.xsd eve

Annotation

EID parent_EID x_Pos y_Pos 10 2 50 100 as the attributes and corresponding values relevant for the example. Next to every tuple of the relations "Element Ref" and "Group" small hints which tuples are de ned are added (for increasing the readability). It is obvious that an EID has to be unique, this is a prerequisite for the logical structure. An EID is created automatically, a user of the EMX can neither in uence nor manipulate it.

The element references contain information about the occurrence ("minOccurs", "maxOccurs"), which are not explicitly given in the XSD of gure 6. The XML Schema speci cation de nes default values in such cases. If an element reference does not specify the occurrence values then the standard value "1" is used; an element reference is obligatory. These default values are also added automatically.

The stored names of element declarations are NCNAME s, but by knowing the target namespace of the corresponding schema (i.e. "eve") the QNAME can be derived. The name of a type de nition is also the NCNAME, but if e.g. a builtin type is speci ed then the name is the QNAME of the XML Schema speci cation ("xs:string", "xs:date").

5. PRACTICAL USE OF EMX

The co-evolution of XML documents was already mentioned in section 1. At the University of Rostock a research prototype for dealing with this co-evolution was developed: CodeX (Conceptual design and evolution for XML Schema) [ 5 ]. The idea behind it is simple and straightforward at the same time: Take an XML Schema, transform it to the specifically developed conceptual model (EMX - Entity Model for XML-Schema), change the simpli ed conceptual model instead of dealing with the whole complexity of XML Schema, collect these changing information (i.e. the user interaction with EMX) and use them to create automatically transformation steps for adapting the XML documents (by using XSLT - Extensible Stylesheet Language Transformations [ 1 ]). The mapping between EMX and XSD is unique, so it is possible to describe modi cations not only on the EMX but also on the XSD. The transformation and logging language ELaX (Evolution Language for XML-Schema [ 8 ]) is used to unify the internally collected information as well as introduce an interface for dealing directly with XML Schema. Figure 9 illustrates the component model of CodeX, rstly published in [ 7 ] but now extended with the ELaX interface.

Schema modificatiEoLnasX

Data supply

XSLT XSD Config Model mapping

Model data Knowledge base CodeX

Spezification of operation

Evolution spezific data

Configuration docXuMmLents

Update notes & evolution results Transformation The component model illustrates the di erent parts for dealing with the co-evolution. The main parts are an import and export component for collecting and providing data of e.g. a user (XML Schemas, con guration les, XML document collections, XSLT les), a knowledge base for storing information (model data, evolution speci c data and co-evolution results) and especially the logged ELaX statements ("Log"). The mapping information between XSD and EMX of table 2 are speci ed in the "Model data" component.

Furthermore the CodeX prototype also provides a graphical user interface ("GUI"), a visualization component for the conceptual model and an evolution engine, in which the transformations are derived. The visualization component realizes the visualization of an EMX introduced in table 2. The ELaX interface for modifying imported XML Schemas communicates directly with the evolution engine.

CONCLUSION

Valid XML documents need e.g. an XML Schema, which restricts the possibilities and usage of declarations, de nitions and structures in general. In a heterogeneous changing environment (e.g. an information exchange scenario), also "old" and longtime used XML Schema have to be modi ed to meet new requirements and to be up-to-date.

EMX (Entity Model for XML-Schema) as a conceptual model is a simpli ed representation of an XSD, which hides its complexity and o ers a graphical presentation. A unique mapping exists between every in the Garden of Eden style modeled XSD and an EMX, so it is possible to representatively adapt or modify the conceptual model instead of the XML Schema.

This article presents the formal de nition of an EMX, all in all there are di erent nodes, which are connected by directed edges. Thereby the abstract data model and element information item of the XML Schema speci cation were considered, also the allowed edges are speci ed according to the speci cation. In general the most important components of an XSD are represented in an EMX, e.g. elements, attributes, simple types, complex types, annotations, constrains, model groups and group de nitions. Furthermore the logical structure is presented, which de nes not only the underlying storing relations but also the relationships between them. The visualization of an EMX is also de ned: outgoing from 18 relations in the logical structure, there are eight EMX nodes in the conceptual model, from which six are visualized.

Our conceptual model is an essential prerequisite for the prototype CodeX (Conceptual design and evolution for XML Schema) as well as for the above mentioned co-evolution. A remaining step is the nalization of the implementation in CodeX. After this work an evaluation of the usability of the conceptual model is planned. Nevertheless we are con dent, that the usage is straightforward and the simpli cation of EMX in comparison to deal with the whole complexity of an XML Schema itself is huge.

[1] XSL Transformations (XSLT) Version 2 .0. http://www.w3.org/TR/2007/REC-xslt20-20070123/, January 2007 . Online; accessed 26-March- 2013 .

[2]

Extensible

Markup Language (XML) 1.0 (Fifth Edition) . http://www.w3.org/TR/2008/REC-xml- 20081126 /, November 2008. Online; accessed 26-March- 2013 .

[3] XQuery 1.0 and XPath 2 . 0 Data Model (XDM) (Second Edition) . http://www.w3.org/TR/2010/ REC-xpath-datamodel- 20101214 /, December 2010 . Online; accessed 26-March- 2013 .

[4]

W3C

XML Schema De nition Language (XSD) 1.1 Part 1: Structures . http://www.w3.org/TR/2012/ REC-xmlschema11- 1 -20120405/, April 2012. Online; accessed 26-March- 2013 .

[5]

Klettke . Conceptual XML Schema Evolution - the CoDEX Approach for Design and Redesign . In BTW Workshops , pages 53 { 63 , 2007 .

[6]

Maler . Schema design rules for ubl...and maybe for you . In XML 2002 Proceedings by deepX , 2002 .

[7]

singer, M. Klettke, and

Heuer . Evolution von XML-Schemata auf konzeptioneller Ebene - U bersicht: Der CodeX-Ansatz zur Losung des Gultigkeitsproblems . In Grundlagen von Datenbanken , pages 29 { 34 , 2012 .

[8]

singer, M. Klettke, and

Heuer. Automatisierte Modelladaptionen durch Evolution - (R)ELaX

in the Garden of Eden . Technical Report CS-01-13 , Institut fur Informatik, Universitat Rostock , Rostock, Germany, Jan. 2013 . Published as technical report CS-01-13 under ISSN 0944-5900.

[9]

E. van der Vlist. XML

Schema. O'Reilly & Associates, Inc., 2002 .