Determining XSLT Streamability Using New Hierarchical XSD Model
Jana Dvořáková and Filip Zavoral
Department Of Software Engineering
Faculty of Mathematics and Physics
Charles University in Prague, Czech republic
{Jana.Dvorakova, Filip.Zavoral}@mff.cuni.cz
Abstract. We introduce a new compact model of XML Schema framework for streaming XML transformations introduced
called a schema tree. Given an XSLT transformation xsl and an in [3]. The framework is capable to process automatically a
XML schema xsd, we present a method which statically analyzes class of top-down XSLT transformations which captures a
the schema tree constructed according to xsd and determines significant number of practically needed XML transforma-
whether xsl can be processed in a streaming manner on a set tions. The processing is done using a stack of the size pro-
of XML documents defined by xsd. We consider streaming pro-
portional to the depth of the input XML document - such
cessing that uses a stack of the size proportional to the depth of
the input document - this processing is highly efficient in practice
processing is highly efficient in practice since real XML
since real-world XML documents are shallow. The schema anal- documents are shallow [9].
ysis is performed via stepwise application of templates of xsl on We focus especially on the schema-based analyzer
the schema tree. We present the implementation of a schema tree which represents a powerful tool used within the Xord frame-
and the static XSLT analyzer on .NET platform. work to determine the most efficient way of processing the
given XSLT transformation. For a given XSLT stylesheet
xsl and an XML schema xsd1 , it automatically analyzes
1 Introduction the memory usage of the streaming processing of xsl on a
set of documents defined by xsd.
Many applications need to employ streaming approach The existing models for XML schemas (DOM, .NET
when processing huge data in XML format. Most typi- XmlSchema) appeared inconvenient for the purpose of the
cally, the languages XSLT [10] and XQuery [13] are used streamability analysis, we therefore introduce the Xord
to specify XML transformations. Both of them enable the Schema Model - a new compact model for schema repre-
user to write a high-level specification based on tree ma- sentation. The model is abstract, and thus not bounded to a
nipulation. Common processors of these languages (e.g., particular schema language. However, in the prototype im-
Saxon, Xalan, AltovaXML) are tree-based, i.e., read the plementation we employ W3C XSD notation [11, 12] for
whole input document into memory and then perform the XML schemas.
transformation itself.
Related work. Several streaming processors for XSLT and
The XSLT and XQuery tree-based processors are ap- XQuery have been implemented. However, their efficiency
parently not suitable when transforming XML streams or was demonstrated only by experiments on a small number
huge XML data. In this case, the transformation can be ei- of XML transformations and input XML documents. It is
ther written by hand using an event-base parser (e.g., SAX, thus not known how much memory is consumed on clearly
StAX) or using some streaming transformation language characterized transformation classes.
(STX [1], XStream [5]). In both cases, writing the speci- XML Streaming Machine (XSM) [8] processes a subset
fication is a non-trivial task since the user must explicitly of XQuery on XML streams without attributes and recur-
handle storing parts of the input document in the memory sive structures. It is based on a model called XML stream-
buffers for later processing. ing transducer. The processor have been tested on XML
In this paper we focus on the problem how to enable the documents of various sizes against a simple query. Using
user to write a tree manipulation specification in the XSLT XSM the processing time grows linearly with the document
language, and at the same time to process it in a streaming size, while in the case of standard XQuery processors the
manner automatically. Such automatic streaming processor time grows superlinearly. More complex queries have not
is supposed to apply the tree-manipulation functions over been tested.
a continuous stream of data while the buffering is treated BEA/XQRL [4] is a streaming processor that implements
automatically. An important issue is to design the processor full XQuery. The processor was compared with Xalan-J
in such a way that the size of memory buffers is minimized XSLT processor on the set of 25 transformations and an-
for the given transformation and the input document. other test was carried on XMark Benchmarks. BEA pro-
We describe the implementation of the Xord frame-
work which represents a prototype XSLT automatic stream- 1
We use the term XML schema for a general schema for XML
ing processor. The Xord framework is based on the formal documents.
adopted to model any XML schema language based on
structure definition.
XfAnalyzer XfAlgorithm
Furthermore, the framework is complemented by a set
of auxiliary helper classes. The algorithmic part of the API
XfXsdSsxtAnalyzer XfTemplateAnalyzer XfSsxt supports:
sch xslt
SsxtAlgorithm algorithm derived from the abstract Al-
xslt xslt
gorithm model, and
XfSchema XfXslt
(SchemaModel) (TemplateModel) XsdSsxtAnalyzer algorithm derived from the abstract An-
alyzer model, and using the Schema Model and the
Template Model.
SchemaModel
Fig. 1. TheTemplateModel
Xord framework. Ssxt
The implementation of the above mentioned models are de-
scribed in more detail in following sections.
cessor was fast on small input documents, however, the
processing of large documents was slower since the opti- 3 XSLT representation
mizations specially designed for XML streams are limited
The Xord framework is currently restricted to process sim-
in this engine.
ple XSLT transformations on XML documents without data
FluXQuery [7] is a streaming XQuery processor based values.
on a new internal query language FluX which extends
XQuery with constructs for streaming processing. XQuery Simple XSLT stylesheets. Simple XSLT stylesheet con-
query is converted into FluX and the memory size is opti- sists of an initializing template and several transforming
mized by examining the query as well as the input DTD. templates. The initializing template sets the current mode
FluXQuery supports a subset of XQuery. The engine was to the initial mode m0 and calls processing of the root ele-
benchmarked against XQuery processors Galax and AnonX ment of the input document. It is of the form:
on selected queries of the XMark benchmark. The results
show that FluXQuery consumes less memory and runtime.
SPM (Streaming Processing Model) [6] is a one-pass
streaming XSLT processor without an additional memory. The transforming templates are of the form:
Authors present a procedure that tries to converts a given ... template body ...
XSLT stylesheet into SPM. No algorithm for testing the
streamability of XSLT is introduced, and therefore the class
of XSLT transformations captured by SPM is not clearly The template body contains output elements (possibly nested)
characterized. and apply-templates calls. Output elements are of the form:
. . . element content . . .
The apply-templates construct has a select attribute
2 Xord Framework
1/1
that contains selecting expression, and a mode attribute.
The Xord framework for analyzing and transforming XML
data is implemented on .NET platform. Its application in- A subset of XPath expression is allowed in templates -
terface is formed by a set of interface classes for traversing they contain child and descendant axis, and select nodes
analyzed data structures. The core of the framework con- by name:
sists of these abstract models (see Fig. 1): XPath := Step | Step/XPath
Step := child::name | descendant::name
1. Template Model for transforming templates implemen-
where name refers to an element name.
ted by the XfXslt classes,
2. Schema Model for XML schemas implemented by the Xord Template Model. In the Xord framework, XSLT sty-
XfSchema classes, lesheets are represented by a set of classes, an Xord Tem-
3. Algorithm Model for streaming algorithms implemen- plate Model. Its simplified object structure is depicted in
ted by the XfSsxt classes, Fig. 2.
4. Analyzer Model for static analyzers implemented by the Each template from the XSLT contains a sequence of
XfXsdSsxtAnalyzer and XfTemplateAnalyzer classes. template calls. A template call consists of the parsed XPath
expression and the template called by the apply-templates
Since the models are abstract, the Template Model may be mechanism. The input template file is parsed into these
adopted to model templates of any template-based XML structures before the analysis. Then the analysis algorithm
transformation language and the Schema Model may be directly traverses the DAG, evaluates the expressions etc.
XfCall XfSchema
Step + template : XfTemplate + map : Dictionary
+ node : XmlNode
+ Name : string + select : XfXpath
+ Descendant : bool + mode : string
map
template
Item node Node
exp calls - node : Node - id : string
select
- minOccurs : int
XfTemplate
- maxOccurs : int
XfXpath + match : string
+ exp : List + mode : string
+ node : XmlNode children
+ calls : List
ComplexElement SimpleElement
calls templates - children : List-
XfXslt
- templates : List
- calls : List
XfRegexp atom Atom
+ regexp : List + id : string
+ type :
Fig. 2. The Xord Template Model.
Fig. 3. The Xord Schema Model.
4 Hierarchical XML schema representation
We represent an XML schema hierarchically as a schema programmatically, but its application interface is not very
tree. The representation does not depend on a particular useful for parsing and analyzing existing schemas.
schema notation (DTD, XSD). The schema tree consists of Since the schema analysis using standard XML schema
two kinds of nodes: DOM model would be very complicated and tangled, we
have designed an Xord Schema Model which is targeted to
– element nodes: correspond to an element type defined effective representation and analysis of existing schemas.
within schema A simplified object structure of that model is depicted in
– constructor nodes: correspond to constructors used in Fig. 3.
the schema (sequence, choice, *, +, ?) The whole schema is represented as an associative ar-
ray of simple or complex type nodes. 1Each
/1 complex node
The relationships among element types and constructors contains a list of references to its child nodes with their
are represented by the structure of the tree. cardinality. Using this recursive structure that form a DAG
Some subtrees of schema tree may be identical - this (or a tree with one particular node selected as a root), the
situation occurs if we derive the1schema
/1
tree from DTD or parsed schema could be easily traversed and processed.
XSD containing shared element types. When designing the
analyzer, the tree representation is more convenient. How-
ever in the implementation of schema-based analyzer each 5 Schema-based analyzer
type is represented as a single node and the whole schema
is represented as a DAG (see Schema Object Model be- The schema-based analyzer applies the given XSLT style-
low). sheet xsl to the schema tree xsd, starting at the root node.
In the schema-based analysis, we consider XML sche- First, let us remind the principles of the XSLT application
mas without the choice constructor and recursive defini- to the XML document tree. Let tmp be the current tem-
tions. Such schema can be represented as a single regular plate of the XSLT stylesheet (at the beginning of the trans-
expression. This representation is useful in the extraction formation, it is the template matching the root element in
part of the analyzer algorithm (see Section 5). the initial mode m0 )
Xord Schema Model. Although there are well established 1. The node sequence selected by the XPath expressions
and widely used XML parsers, we have found no suitable in the rule calls of the current template are found.
parser for XSD. To perform schema manipulation, the .NET 2. The templates called by the rule calls are applied to the
Framework provides a set of classes called the Schema Ob- selected nodes.
ject Model, or SOM for short. The SOM is for schemas However, in case of the schema tree, a modification of the
what DOM is for XML documents: the SOM classes repre- first step of this simple algorithm is needed:
sent various parts of a schema, for example XmlSchemaSim-
pleType, XmlSchemaElement, there are many other classes 1. All possible node sequences selected by the XPath ex-
that represent attributes, facets, groups, complex types, and pressions in the rule calls of the current template are
so on. This model is especially useful for creating schemas found.
bool AnalyzeNode(XfTemplate t,XfSchema.Node n) {
2. The templates called by the rule calls are applied to the if(t.Empty)
selected nodes. return true;
XfLastNames li = t.GetLastNames();
XfRegexp re = sch.ExtractFragment(n, t);
Since the set of all possible node sequences selected by if(re.Empty())
XPath expressions in the first step may be infinite, we rep- return true;
if(! sch.Compare(re, li))
resent it in the form of regular expression regexp. Such return false;
regular expression is basically a fragment of the schema foreach(XfCall call in t.calls) {
List ln =
tree, i.e., a set of its nodes (not necessarily connected) which new List();
is a fragment of the regular expression representing the ln = sch.EvalExp(n, call.select);
foreach(XfSchema.Node ni in ln) {
whole schema tree. AnalyzeNode(call.template, ni);
The regular expression regexp may contain both ele- }
}
ment nodes and constructor nodes. It is extracted as fol- return true;
lows: First, the node sequence selected by the XPath ex- }
pressions are found in the same way as in the XML docu- Fig. 4. The code of the AnalyzeNode function.
ment tree (constructor nodes are skipped). Second, all con-
structors appearing in the branch of the schema tree from
the root to the selected nodes are added to the sequence. 6 Stack-based streaming algorithm
The hierarchy of the nodes is preserved by delimiting the
nodes appearing at the same level of the schema tree by The stack-based algorithm is based on a formal model called
parentheses. simple streaming XML transducer (SSXT), therefore we call
We say that regexp represents possible reading orders it the SSXT algorithm. The transducer has a single input
of the element names selected by the expressions in tmp, head that reads the input document sequentially, and a sin-
i.e., the order in which the elements are accessed when a gle output head that generates the output document sequen-
document defined by the schema xsd is read sequentially. tially. The SSXT is equipped with a stack to store tempo-
Now let names be a sequence of element names in the or- rary data.
der they are called in tmp - clearly, the sequence can be The SSXT takes an input document din and a top-down
constructed statically by examining the last steps of the XSLT stylesheet xsl as the input. It reads din sequentially
XPath expressions in tmp. The names sequence repre- in one pass and apply the stylesheet xsl stepwise. First,
sents the processing order of the elements. In case one of the template matching the root element of din in the initial
the reading orders does not conform to the processing or- mode m0 is set to be the currently processed template (cur-
der, the order-preservation of the xsl is violated and the rent template). The processing proceeds in cycles. During
SSXT algorithm is not applicable2 . It is thus only neces- a single cycle, a single template call of the current template
sary to compare regexp to the names sequence in order to is processed.
check applicability ot the stack-based algorithm. Processing cycle. All XPath expression within a template
are evaluating concurrently. The evaluation is realized by
Implementation. The core of the schema-based analyzer deterministic finite automata (DFA)3 . A single DFA is con-
is the AnalyzeNode function which takes two arguments structed for each expression. When the processing of a tem-
- a template of xsl and a node of the schema tree xsd. plate starts, the sequence of the initial states of DFAs is
It performs the application of the template to the schema pushed on the stack. The input head of SSXT reads the
tree node as described above. Using the Template Model elements of din in document order. When a start-tag is en-
and the Schema Models allows the analyzer algorithm to countered, new sequence of DFAs is computed. Three sit-
be simple and straightforward - see Fig. 4. uations may occur:
The comparison of the regexp to the names sequence
is accomplished by the Compare function. Its implemen- a) new sequence contains no final state - the input head
tation is based on inherent properties of its arguments. In- continues in evaluation,
stead of an expensive checking of swapping for each pair b) new sequence contains a single final state which be-
of names, the predicate is a compound of two simple steps. longs to the DFA evaluating the lastly-matched expres-
First, regexp is checked for existence of two distinct names sion or an expression located after the lastly-matched
within any ’+’ or ’*’ sequence. Second, the last names in expression - the corresponding template call is pro-
names are stripped to those contained in the schema be- cessed,
c) new sequence contains a final state which belongs to
ing used, adjacent duplicities are reduced to a single name,
the DFA evaluating expression located before the lastly-
and the resulting list is linearly compared to names con-
matched expression, or it contains two or more final
tained in regexp. Since each name appearing regexp must
states - error.
be contained in names, any difference cause a fail.
3
We refer the reader to [2] for a more detailed description of
2
See [3] for further details. this evaluating method.
void ProcessDfaSequence(XfXml xml)
In case b), the current cycle configuration (template id, {
matched expression id) is pushed on the stack and new SIDfaSequence ds = stk.GetDfaSequence();
switch(xml.currType) {
cycle for processing the called template starts. The cycle case XmlNodeType.Element:
configuration is popped after the whole called template has SIDfaSequence new ds = ds.Transition(xml.currName);
if(!new ds.HasFinalStates()) {
been processed and the control moves back to the current stk.Push(new ds);
template. In case a), the evaluation continues. Here if an xml.Advance();
}
end-tag is encountered, the sequence of the DFA states lo- else {
cated at the top of the stack is popped. Hence, the XPath ex- XfCall myCall = new ds.GetCallWithFinalState();
currTemplate.Generate(currCall, myCall);
pression of the current template are evaluated on “branches” XfTemplate calledTemplate =
of din . xslt.SelectTemplate(xml.currName, myCall.mode);
if(calledTemplate.Empty) {
calledTemplate.Generate(null, null);
currCall = myCall;
Implementation. The implementation of the SSXT algo- if(xml.laType == XmlNodeType.Element)
rithm uses both Template Model and Algorithm Model classes. stk.Push(new ds);
xml.Advance();
Since the algorithm is stack-based, the main data struc- } else {
ture used is a polymorphic stack stk of sequences of DFA stk.Push(new SICycleConfig(currTemplate,
myCall));
states (SIDfaSequence) and cycle configurations (SICycle- currTemplate = calledTemplate;
Config), see Fig. 8. currCall = null;
}
Until the transformation is finished the top of stack }
break;
is checked and the stack item is processed, see the func- case XmlNodeType.EndElement:
tion RunSsxt in Fig. 5. In case of an empty stack and if(xml.laType == XmlNodeType.EndElement)
stk.Pop();
nonempty remaining input new DFA sequence is pushed xml.Advance();
on the stack. break;
default:
stk.Pop();
break;
void RunSsxt(XfXml xml) }
{ }
XfTemplate currTemplate = xslt.Start();
XfCall currCall = null; Fig. 6. The code of the ProcessDfaSequence function used in the SSXT algorithm.
bool transformed = false;
while(!transformed) {
if(!stk.Empty()) {
switch(stk.Type()) {
case XfStack.ItemType.DfaSequence: The cycle configuration processing (Fig. 7) depends on
ProcessDfaSequence(); the current XML node type. A start tag pushes a new DFA
break;
case XfStack.ItemType.CycleConfig: sequence while an end tag generates output.
ProcessCycleConfig();
break;
}
void ProcessCycleConfig(XfXml xml)
} else {
{
switch(xml.currType) {
SICycleConfig cc = stk.GetCycleConfig();
case XmlNodeType.Element:
switch(xml.currType) {
stk.Push(new SIDfaSequence(currTemplate));
case XmlNodeType.Element:
xml.Advance();
if(xml.laType == XmlNodeType.Element)
break;
stk.Push(new SIDfaSequence(currTemplate));
case XmlNodeType.EndElement:
xml.Advance();
currTemplate.Generate(currCall, null);
break;
transformed = true;
case XmlNodeType.EndElement:
break;
currTemplate.Generate(currCall, null);
}
currTemplate = cc.template;
}
currCall = cc.call;
}
stk.Pop();
}
break;
}
Fig. 5. The code of the SSXT algorithm. }
Fig. 7. The code of the ProcessCycleConf function used in the SSXT algorithm.
The core of the DFA sequence processing (Fig. 6) is
accomplished when start tags of elements are encountered.
A new DFA sequence is generated on the stack in case the
current DFA sequence contains no final states. Otherwise 7 Evaluation
the output is generated and a new cycle configuration is
placed on the stack. In case of a template without calls, its The evaluation and measurements of the SSXT algorithm
output is generated immediately. implementation confirmed our expectation that it requires
ing algorithm with such buffers and we obtain much more
XfCall
(TemplateModel)
XfTemplate
(TemplateModel)
powerful automatic streaming XSLT processor.
DfaItem
+ template : XfTemplate + match : string
+ node : XmlNode call
+ call : XfCall + mode : string Acknowledgments. This work was supported in part by
+ select : XfXpath + node : XmlNode
+ mode : string + calls : List the National programme of research (Information society
template
project 1ET100300419).
call
items
SICycleConfig SIDfaSequence
+ template : XfTemplate + template : XfTemplate References
+ call : XfCall + items : List
1. O. Becker. Transforming XML on the Fly. In Proceedings
cc
ds of XML Europe 2003, 2003.
StackItem XfStack 2. Y. Diao, M. Altinel, M. J. Franklin, H. Zhang, and P. Fischer.
- cc : SICycleConfig stack - st : Stack Path sharing and predicate evaluation for high-performance
- ds : SIDfaSequence
XML filtering. ACM Trans. Database Syst., 28(4):467–516,
stk
2003.
XfAlgorithm
3. J. Dvořáková. Towards Analyzing Space Complexity of
# xslt : XfXslt Streaming XML Transformations. In The Second IEEE Inter-
# stk : XfStack
national Conference on Research Challenges in Information
Science. IEEE Computer Society, 2008.
Fig. 8. The Xord SSXT Model. 4. D. Florescu, C. Hillery, D. Kossmann, P. Lucas, F. Riccardi,
T. Westmann, M. J. Carey, A. Sundararajan, and G. Agrawal.
The BEA/XQRL Streaming XQuery Processor. In Proceed-
ings of VLDB 2003, pages 997–1008, 2003.
a memory proportional to a depth of the input XML doc- 5. A. Frisch and K. Nakano. Streaming XML Transformations
Using Term Rewriting. In Proceedings of PLAN-X 2007,
ument. Since most documents are relatively shallow, our
2007.
memory requirements are independent to the document size.
6. Z. Guo, M. Li, X. Wang, and A. Zhou. Scalable XSLT Eval-
Even for huge documents like DBLP, the SSXT algorithm uation. In Advanced Web Technologies and Applications,
required few hundreds KB while the commonly used XSLT LNCS 3007/2004. Springer Berlin / Heidelberg, 2004.
processors like Saxon or Xalan crashed or hanged after al- 7. C. Koch, S. Scherzinger, N. Schweikardt, and B. Stegmaier.
locating about 1.5 GB of memory. FluXQuery: An optimizing XQuery processor for streaming
XML data. In VLDB’2004: Proceedings of the Thirtieth
International Conference on Very Large Databases, pages
8 Conclusion and future 1/1
work 1309–1312, 2004.
8. B. Ludäscher, P. Mukhopadhyay, and Y. Papakonstantinou.
A Transducer-Based XML Query Processor. In Proceedings
We have presented a prototype implementation of the Xord
of VLDB 2002, pages 227–238, 2002.
framework which represents an automatic streaming pro- 9. I. Mlýnková, K. Toman, and J. Pokorný. Statistical Analysis
cessor for the XSLT language. It incorporates a powerful of Real XML Data Collections. In COMAD’06: Proc. of the
schema-based analyzer which, for a given XSLT transfor- 13th Int. Conf. on Management of Data, pages 20–31, New
mation xsl and an XML schema xsd, analyzes memory Delhi, India, 2006. Tata McGraw-Hill Publishing Company
requirements of the streaming processing of xsl on a set Limited.
of XML documents defined by xsd. The analyzer employs 10. W3C. XSL Transformations (XSLT) Version 1.0, W3C Rec-
a special hierarchical model of XML schema called Xord ommendation, 1999. http://www.w3.org/TR/xslt.
Schema Model. We have implemented the Xord framework 11. W3C. XML Schema Part 1: Structures Sec-
on .NET platform for a specific streaming processing using ond Edition, W3C Recommendation, 2004.
stack of the size proportional to the depth of the input XML http://www.w3.org/TR/xmlschema-1.
12. W3C. XML Schema Part 2: Datatypes Sec-
document.
ond Edition, W3C Recommendation, 2004.
Our schema-based analyzer is restricted in several as- http://www.w3.org/TR/xmlschema-2.
pects - first, a subset of XSLT and XML schema defini- 13. W3C. XQuery 1.0: An XML Query Language, W3C Recom-
tions is considered, and second, it currently gives us only mendation, 2007. http://www.w3.org/TR/xquery.
true/false answer whether the stack-based processing is ap-
plicable. However, we intend to extend it in the future re-
search - if we examine particular pairs of elements for which
the comparing function returns false and the possible size
of their content, we may compute exact size of the mem-
ory buffers needed for processing such elements. Then it
is only necessary to extend the basic stack-based stream-