=Paper=
{{Paper
|id=None
|storemode=property
|title=XML Document Correction and XQuery Analysis with Analyzer
|pdfUrl=https://ceur-ws.org/Vol-706/paper15.pdf
|volume=Vol-706
|dblpUrl=https://dblp.org/rec/conf/dateso/StarkaSSMB11
}}
==XML Document Correction and XQuery Analysis with Analyzer==
<pdf width="1500px">https://ceur-ws.org/Vol-706/paper15.pdf</pdf>
<pre>
           XML Document Correction and XQuery
           XML Document Correction and XQuery
                 Analysis with Analyzer
                 Analysis with Analyzer
          Jakub Stárka, Martin Svoboda, Jiřı́ Schejbal, Irena Mlýnková, and
       Jakub Stárka, Martin Svoboda, Jiřı́Bednárek
                                  David      Schejbal, Irena Mlýnková, and David
                                      Bednárek
         Department of Software Engineering, Faculty of Mathematics and Physics
         Department ofCharles
                       Software  Engineering,
                              University        FacultyCzech
                                            in Prague,  of Mathematics
                                                             Republic and Physics
                              Charles University in Prague
               {starka,svoboda,mlynkova,bednarek}@ksi.mff.cuni.cz
                Malostranské náměstı́ 25, 118 00 Praha 1, Czech Republic
              {starka, svoboda, mlynkova, bednarek}@ksi.mff.cuni.cz

          Abstract. This paper describes extensions of our previously proposed
          SW prototype – Analyzer, a framework for performing statistical anal-
          yses of real-world XML data. Firstly, it describes the design and imple-
          mentation of a system for the analysis of collection of XQuery programs.
          It is based on the frequency of the occurrence of various language con-
          structs and their combinations defined by the user. In the core of the sys-
          tem, the XQuery program is converted to a suitable XML representation
          which allows for analytical queries formulated in the XPath language.
          Secondly, we introduce the model involving repairs of elements and at-
          tributes with respect to single-type tree grammars. Via the inspection
          of the state space of an automaton recognising regular expressions, we
          are always able to find all minimal repairs represented by recursively
          nested multigraphs, which can be translated to particular sequences of
          edit operations altering data trees. We have proposed four particular al-
          gorithms and provided the prototype implementation supplemented with
          experimental results.


   1    Introduction
   The eXtensible Markup Language (XML) [5] is currently a de-facto standard for
   data representation. Its popularity is given by the fact that it is well-deﬁned,
   easy-to-use and, at the same time, enough powerful. The problem is that the
   XML standards were proposed in full possible generality so that future users
   can choose what suits them most. Nevertheless, the real-world XML data are
   usually not so “rich”, thus the eﬀort spent on every possible feature is mostly
   useless.
       Exploitation of results of statistical analyses of real-world data is a classical
   optimization strategy in various areas of data processing. It is based on the idea
   to focus primarily on eﬃcient implementation of constructs that are used in
   real-world data most often. One of the most important advantages of statistical
   analyses of real-world data is refutation of incorrect assumptions on typical use
   cases, features of the data, their complexity etc. As an example we can consider
   exploitation of recursion. The support for recursion is often neglected and it is
   considered as a side/auxiliary construct. However, analyses [11] show that in


V. Snášel, J. Pokorný, K. Richta (Eds.): Dateso 2011, pp. 61–72, ISBN 978-80-248-2391-1.
262    Jakub Analysis
       Query Stárka etand
                        al. Document Correction with Analyzer


selected types of XML data it is used quite often and, hence, is eﬃcient, or at
least any support is very important. On the other hand, the number of distinct
recursive elements is typically low (for each category less than 5) and that the
type of recursion commonly used is very simple.
    However, working with real-world data is not simple, because they can often
change, are not precise, or even involve a number of errors. In this case we can
either discard the incorrect data, and, hence, loose a signiﬁcant portion of them,
or provide a kind of corrector.
    In the next step we want to make the analyses themselves. Currently there
exists a number of papers which focus on statistical analyses of real-world XML
data [11, 3, 10], however an analysis of real-world XML operations, in particu-
lar queries, is still missing. The reason is mainly the complexity of crawling a
representative set and the complexity of the analytical process.
    In this paper we describe two parts of a general framework for statistical
analyses of real-world XML data called Analyzer. Firstly, we focus on a correction
framework involving structural repairs of elements with respect to a single-type
tree grammar. Secondly, we describe the usage of the framework for XQuery
analysis. Since there is no standardized real-world datasets, we use two artiﬁcal
collections to demonstrate the approach.

Outline The paper is structured as follows: In Section 2 we describe the related
work, in particular concerning corrections of data. In Section 3, we describe the
architecture of Analyzer which indicates its general functionality. Section 4 is
devoted to processing of incorrect data. In Section 5, we show the principles of
used query analysis and we show results of a query analysis of some artiﬁcial
data. Finally, in Section 6 we conclude.

Relation to Previous Work In this paper, we extend our previous work [17].
Motivated by a successful and interesting statistical analysis of real-world XML
data [11], Analyzer was implemented as a SW project of Master students of
the Department of Software Engineering of the Charles University in Prague.
Its installation package as well as documentation and source ﬁles can be found
at its oﬃcial web site [14]. Its ﬁrst release 1.0 involved only basic functionality
to demonstrate its key features and advantages and it was brieﬂy introduced
in paper [17]. In the following text, we describe extensions of Analyzer focused
on XML document correction which was ﬁrstly proposed in [15] and extended
in [16], and new module for XQuery analysis.

2     Related Work
As we have mentioned in the introduction, while currently there exists a number
of papers focussing on statistical analysis of XML documents, XML schemas
or their mutual comparison [11, 3, 10], there is no paper that would describe
either the results or the process of analysis of real-world XML queries. Thus in
this section we focus on the the related work of the second aim of this paper –
correction of XML documents, in particular their re-validation.
            XML Document
            XML Document Correction
                         Correction and
                                    and XQuery
                                        XQuery Analysis
                                               Analysis with
                                                        with Analyzer
                                                             Analyzer        63
                                                                              3

    The proposed correction model is based primarily on ideas from [2] and [13].
Authors of the former paper dynamically inspect the state space of a ﬁnite au-
tomaton for recognising regular expressions in order to ﬁnd valid sequences of
child nodes with minimal distance. However, this traversal is not eﬀective, re-
quires a threshold pruning to cope with potentially inﬁnite trees, repeatedly
computes the same repairs and acts eﬃciently only in the context of incremen-
tal validation. Although these disadvantages are partially handled in the latter
paper, its authors focused on documents querying, but not repairing.
    Next, we can mention an approximate validation and correction approach
[18] based on testers and correctors from the theory of program veriﬁcation. Re-
pairs of data inconsistencies like functional dependencies, keys and multivalued
dependencies are the subject of [12, 20].
    Contrary to all existing approaches, we consider single type tree grammars
instead only local tree grammars. Thus, we work both with DTD and XML
Schema. Approaches in [2, 18] are not able to ﬁnd repairs of more damaged
documents, we are able to always ﬁnd all minimal repairs and even without
any threshold pruning to handle potentially inﬁnite XML trees. Next, we have
proposed much more eﬃcient algorithm following only perspective ways of the
correction and without any repeated repair computations. Finally, we have a
prototype implementation available at [7] and performed experiments show a
linear time complexity depending on a number of nodes in documents.

3   Framework Description
This section brieﬂy concerns with Analyzer architecture, proposed analyses model
and basic implementation aspects. The details are described in paper [17].

Architecture. The Analyzer allows to work with multiple opened projects at
once, each representing one analytical research intent. Thus, we can divide the
framework architecture into two separate levels, as it is depicted in Figure 1.
The ﬁrst one contains components, which are shared by all these projects. The
second one represents components exclusively used and created in each opened
project separately (repositories, storages, crawlers and entity managers).


                    Fig. 1. Analyzer Framework Architecture

    Repositories serve for storing all computed analytical data and the majority
of project conﬁguration metadata. Storages are used for storing document con-
tents, i.e. binary contents of analyzed ﬁles. Finally, documents to be analyzed
464     Jakub Analysis
        Query Stárka etand
                         al. Document Correction with Analyzer


can be inserted into existing projects through import sessions (locally accessible
ﬁles) or download sessions (downloading ﬁles from the Internet via crawlers).
    The project layer contains a set of managers, which are responsible for creat-
ing, editing and processing of all analysis entities such as documents, collections
of documents or reports over collections. As all computed analytical data are
stored permanently in a repository, in order to increase eﬃciency, these man-
agers are able to cache loaded data and some of them even to postpone and
aggregate required update operations without violating the consistency.

Analyses. Although the framework enables also more complex usages, the stan-
dard life cycle of each project can be represented by the following phases.
 1. Creation of a new project and conﬁguration of its components,
 2. Selection and conﬁguration of analyses using available plugins,
 3. Insertion of documents to be analyzed through import or download sessions,
 4. Computation of analytical results over documents of a given relative age,
 5. Selection and conﬁguration of collections and clusters of them,
 6. Document classiﬁcation and assignment into collections, and
 7. Computation of ﬁnal statistical reports over particular collections.

Plugins. Analyzer itself provides a general environment for performing analyses
over documents and collections of documents, but the actual analytical logic is
not a part of it. All analytical computations and mechanisms are implemented
in plugins. Not only that each particular plugin is intended only for processing of
selected document types only, the user is also able to conﬁgure available plugins
and, thus, adjust their behaviour to desired analytical intents.
    The plugin functionality itself is provided through implemented methods,
which are of eight predeﬁned types listed in the following enumeration.
 – The detector recognizes types of a processed document,
 – The tracer looks for outgoing links in a given document,
 – The corrector attempts to repair a content of a given document,
 – The analyzer produces results over a given document,
 – The collector classiﬁes documents into collections of a given cluster,
 – The provider creates reports over documents in a collection,
 – The viewer serves for browsing computed results over a document, and
 – The performer serves for browsing computed reports over a collection.


4     Processing of Incorrect Data
During the phase of document processing and result generating in Analyzer
framework, corrector methods of available plugins are able to modify data con-
tents of such documents. This feature is motivated primarily by the possibility
of working with potentially incorrect documents.
    In this section, we particularly focus on the problem of structural invalidity of
XML documents. In other words, we assume the inspected documents are well-
formed and constitute trees, however, these trees do not conform to a schema in
               XML Document
               XML Document Correction
                            Correction and
                                       and XQuery
                                           XQuery Analysis
                                                  Analysis with
                                                           with Analyzer
                                                                Analyzer        65
                                                                                 5

DTD [5] or XML Schema [9]. Having a potentially invalid XML document, we
process it from its root node towards leaves and propose minimal corrections of
elements in order to achieve a valid document close to the original one.


         (a)                   (b)                (c)                (d)

        Fig. 2. Sample invalid XML tree with three possible minimal repairs


    In Figure 2 we can see a sample correction process. Item (a) represents an
original XML document, where element names are depicted by labels inside
nodes. Without any detailed schema knowledge, assume only that element x at
position 0 is not allowed. Processing this invalid tree, our algorithm ﬁnds three
diﬀerent minimal repairs, all of which are outlined in Items (b), (c) and (d).
    The remaining parts of this section will present basic ideas of our correction
model and proposed algorithms for ﬁnding structural repairs of invalid XML
documents. Details of this proposal are presented in [16, 15].

4.1   Proposed Solution
Our correction framework is capable to generate local structural repairs for in-
valid elements. These repairs are motivated by the classic Levenshtein metric for
strings. For each node in a given XML tree and its sequence of child nodes we at-
tempt to eﬃciently inspect new sequences that are allowed by the corresponding
content model and that can be derived using the extended concept of measuring
distances between strings. However, in our case we do not handle only ordinary
strings, but sequences derived from node labels with nested subtrees.
    The correction algorithm starts processing at the root node and recursively
moves towards leaf nodes. We assume that we have the complete data tree loaded
into the system memory and, therefore, we have a direct access to all its parts.
Under all conditions the algorithm is able to ﬁnd all minimal repairs, i.e. repairs
with the minimal distance to the grammar and the original data tree according
to the introduced cost function.

Edit Operations. Edit operations are elementary transformations that are used
for altering invalid data trees into valid ones. They behave as functions, perform-
ing small local modiﬁcations with a provided data tree. Despite the correction
algorithm does not directly generate sequences of these edit operations, we can,
in the end, acquire them using a translation of generated repairs, as it will be
explained later. We have proposed and implemented edit operations capable to
insert a new leaf node, delete an existing one and rename a label of a node.
666    Jakub Analysis
       Query Stárka etand
                        al. Document Correction with Analyzer


    Edit operations can be composed together into sequences. And if these se-
quences fulﬁl certain qualities, they can be classiﬁed as update operations. In
this way we can work with update operations capable to insert a new subtree,
delete an existing subtree and recursively repair a subtree with an option of
changing a label of its root node.

Repairing Instructions. Assume that we are in a particular node in a data
tree and our goal is to locally correct this node by correcting the sequence of its
child nodes. Since the introduced model for measuring distances uses only non-
negative values for the cost function, in order to acquire the global optimum, we
can simply ﬁnd minimal combinations of local optimums, i.e. minimal repairs
for all subtrees of original child nodes of the inspected one.
    However, we need to ﬁnd all minimal repairs and even represent them in a
compact repair structure. For this purpose we use repairing instructions. We have
exactly one instruction for each edit operation and these instructions represent
the same transformation ideas, however, do not include particular positions to
be applied on. Having a sequence of instructions at a given level, we can easily
translate it into all corresponding sequences of edit operations later on.

Correction Intents. Being in a particular node and repairing its sequence of
child nodes, the correction algorithm generally has many ways to achieve the
local validity proposing repairs for all involved child nodes. As already outlined,
these actions follow the model of measuring distances between ordinary strings.
The Levenshtein metric is deﬁned as the minimal number of required elementary
operations to transform one string into another.
    We follow the same model, however, we have edit and update operations
respectively and sequences of nodes. For example, an insertion of a new subtree
at a given position stands for the insertion of its label into the corresponding
node sequence and, of course, recursive generation of such new subtree.
    The algorithm attempts to examine all suitable new words that are in the
language of the provided regular expression restraining the content model of the
inspected parent node. We do not generate word by word, but we inspect all
suitable words statically using a notion of a correction multigraph. Correction
intents represent assignments for these multigraphs, i.e. the recursive data tree
processing in a top-down manner.

Correction Multigraphs. All existing correction intents in a context of a
given parent node can be modelled using a multigraph for this node. Vertices of
a multigraph for n child nodes can be divided into n + 1 disjoint strata, vertices
of each stratum correspond to states of the Glushkov automaton for recognising
the provided regular expression. Edges in a multigraph are derived from the
automaton transition function and they represent nested correction intents.
    In order to ﬁnd best repairs for a provided sequence of nodes, we need to
ﬁnd all shortest paths in the multigraph. Therefore, we ﬁrst need all its edges to
be associated with already evaluated nested repair structures and their minimal
costs. And this represents nontrivial nested recursive computations. Anyway, we
            XML Document
            XML Document Correction
                         Correction and
                                    and XQuery
                                        XQuery Analysis
                                               Analysis with
                                                        with Analyzer
                                                             Analyzer           67
                                                                                 7

require that each edge can be evaluated in a ﬁnite time, otherwise we would
obviously not be able to ﬁnd required shortest paths at all.

Repairs Construction. Each correction intent can essentially be viewed as
an assignment to the nested recursive processing. The correction of a provided
data tree is initiated as a special starting correction intent for the root node
and processing of every intent always involves the construction of at least the
required part of the introduced multigraph with other nested intents.
    Therefore, we continuously invoke recursive computations of nested intents.
When we reach the bottom of the recursion, we start backtracking, i.e. encap-
sulating all found shortest paths into a form of a compact repair structure and,
then, passing it one level up, towards the starting correction intent.
    Having constructed a repair structure for the starting intent, we have found
corrections for the entire data tree. Each intent repair contains encoded shortest
paths and related repairing instructions. Now we need to generate all particular
sequences of repairing instructions and translate them into standard sequences
of edit operations. Having one such edit sequence, we can apply it on the original
data tree and we obtain its valid correction with a minimal distance.

Correction Algorithms. Now, we have completely outlined the proposed cor-
rection model. However, there are several related eﬃciency problems that would
cause signiﬁcantly slow behaviour, if we would strictly follow this model. There-
fore, we have introduced two particular correction algorithms, which are de-
scribed in detail in [16]. They both produce the same repairs, but there are
diﬀerences in their eﬃciency.
    The ﬁrst algorithm is able to directly search for shortest paths inside each
intent computation and, therefore, does not need the entire multigraphs to be
constructed. The next improvement is based on caching already computed re-
pairs using signatures distinguishing diﬀerent correction intents, but intents with
the same resulting repair structure. This causes that this algorithm never com-
putes the same repair twice. The second algorithm is able to compute lazily
even to the depth of the recursion. We have achieved this behaviour by scat-
tering all nested intents invocation and multigraph edges evaluation into small
tasks, which are incrementally executed by a simple scheduler.


5   Query Analysis

In this section, we describe the second unique feature of Analyzer – XQAnalyzer
– a tool designed to support studies that include analysis of a collection of
XQuery programs. XQAnalyzer consumes a set of XQuery programs, converts
them into a kind of intermediate code, and stores this internal representation
in a repository. Subsequently, various analytical queries may be placed on the
repository to determine the presence or frequency of various language constructs
in the collection, including complex queries focused on particular combinations
of constructs or classes of constructs.
868    Jakub Analysis
       Query Stárka etand
                        al. Document Correction with Analyzer


                                                         Analytical query


                          XQAnalyzer
             XQuery                           IR
                            XQConverter                   XQEvaluator
            collection                     repository


                                                          Query results


                     Fig. 3. The architecture of the XQAnalyzer


   The architecture of the XQAnalyzer is shown in Figure 3. Each document
from a given collection of XQuery programs is parsed and converted to the in-
ternal representation by the XQConverter component. The XQEvaluator com-
ponent evaluates analytical queries and returns statistical results.


5.1   Analytical Queries

In the XQAnalyzer, the term analytical query denotes a pattern or condition
placed on a XQuery program, usually a search for a feature. Each XQuery pro-
gram in the collection is evaluated independently, producing either a boolean
value or a hit count representing the presence or number of occurrences in the
program, respectively. The XQEvaluator then returns various statistical results
like the percentage of programs which contain the searched feature or the his-
togram of hit counts over the repository.
    Given the fact that the tool is designed for research in the area of XML and,
in particular, XQuery, the best choice would be a query language derived from
XPath. XPath is naturally well-known in the community and it is designed to
place pattern-like queries on tree structures – in our case, a tree is a typical
representation of a program during early stages of its analysis.


5.2   Internal Representation of XQuery Programs

The key issue in the design of XQAnalyzer is the internal representation of
XQuery programs. In our approach, we do not want to limit the nature of the
analytical queries; therefore, the internal representation must store any XQuery
program without loss of any feature (perhaps except of comments). Further-
more, the internal representation is exposed to the user via the query interface;
therefore, it should be as simple as possible. Finally, the internal representation
aﬀects the performance of the XQEvaluator.
            XML Document
            XML Document Correction
                         Correction and
                                    and XQuery
                                        XQuery Analysis
                                               Analysis with
                                                        with Analyzer
                                                             Analyzer           69
                                                                                 9

   The W3C standards related to XQuery deﬁne at least the following two
formalisms that might be used as a base for our internal representation:

 – The XQuery Grammar (in Extended Backus-Naur Form) deﬁned in [4].
 – The Normalized XQuery Core Grammar (also in EBNF) deﬁned in [8].

   Note that the XQuery formal semantics [8] is deﬁned in terms of static/dynamic
evaluation rules that may be considered a kind of internal representation too.
However, their application in our analytical environment would be impractically
diﬃcult.
   Among the existing formalisms mentioned so far, we have chosen the Nor-
malized XQuery Core Grammar. There are the following reasons behind this
decision:

 – It is a part of the standard, therefore well known and not skewed towards
   any evaluation strategy.
 – It is smaller than the full XQuery Grammar and it hides the redundant
   features of the XQuery language.

    The ﬁnal set of nonterminals is listed in Tab. 1 together with their frequency
in selected collections of XQuery programs (see Sec. 5.3). When our internal
representation is presented in the form of a XML document, these nonterminals
become XML elements.
    The rest of the semantic information is enclosed in XML attributes attached
to these elements. These attributes contain either data extracted from the source
text (like names of variables or contents of literals) or additional semantic in-
formation (like the axis used in an XPath axis step). In addition to these data
required to preserve the semantics, we also added attributes that may help recov-
ering the original syntax before the normalization to XQuery Core (e.g. whether
the abbreviated or the full syntax was used in axis step).


5.3   Results

Since there is no standardized collection of real-life XQuery programs yet (except
of small benchmarks like XMark [1]), we have chosen two artiﬁcial collections
associated to the W3C XQuery language speciﬁcation: The XQuery Use Cases
[6] and the XQuery Test Suite [19]. The Use Cases collection consists of 85
“text-book” XQuery programs prepared to demonstrate the most important
features of the language, the Test Suite collection contains 14869 small XQuery
programs created to cover all features (the remaining 252 ﬁles in the original
collection contain intentional parse errors). Although the Test Suite collection is
more than 100 times larger in terms of the number of ﬁles, the real ratio of sizes
(in terms of the number of AST nodes) is 31:1 because the Use Cases ﬁles are
larger.
    In Tab. 1 we show the frequency of core elements of the language, named
accordingly to the abstract grammar nonterminals derived from the Normalized
70
10     Jakub Analysis
       Query Stárka etand
                        al. Document Correction with Analyzer


XQuery Core Grammar (see Sec. 5.2). The percentages are deﬁned by the number
of occurences divided by the total number of abstract syntax tree nodes in the
collection (which was 4 469 for the Use Cases and 138 949 for the Test Suite).
    Besides the obvious diﬀerence between the two collections, corresponding to
their purpose, there are the following noticeable observations: The frequency
of quantiﬁed expressions (some or every) is about eight times smaller than
the frequency of for-expression. The if-expression is quite rare – once per 30
for-expressions or 50 operators. A number of features like ordered/unordered-
expressions are omitted in the Use Cases. While frequent in the Test Suite, the
comma operator is surprisingly rare in the Use Cases.
    Table 2 shows the use of the twelve XPath axes. The percentages represent
the frequency of individual axes among all axis step operators in the collection
(which was 638 for the Use Cases and 6 623 for the Test Suite). Notice that the
results correspond to the traditional belief that many axes are extremely rare.


6    Conclusion

The main aim of this paper was to describe several research problems related
to a complex extensible framework for analyses of real-world XML data called
Analyzer. Firstly, we have proposed a correction framework dealing with invalid
nesting of elements in XML documents using the top-down recursive processing
of potentially invalid data trees. Contrary to existing approaches, we have con-
sidered the class of single type tree grammars instead only local tree grammars.
We are able to ﬁnd all minimal repairs. Secondly, we described XQAnalyzer and
implemented a tool for analysis of collections of XQuery programs. XQAnalyzer
works with a set of XQuery programs and translates them into an intermediate
code. Subsequently, analytical queries may be placed over these translations to
get the precence of quantity of speciﬁc constructs.
    In our future plans, we will focus on further improvements of existing plugins
related to XML data analyses and their exploitation in throughout analysis of
both current state of real-world XML documents and evolution of XML data in
the following months. We plan to repeat the analysis monthly and publish the
new as well as aggregated results on the Web. We believe that such a unique
analysis will provide the research community with important results useful for
both optimization purposes as well as development of brand new approaches.
Concurrently, we will shift our target area to the new types of data such as RDF
triples, linked data, ontologies etc.


Acknowledgement

This work was partially supported by the Czech Science Foundation (GAČR),
grants number 201/09/P364 and P202/10/0573.
            XML Document
            XML Document Correction
                         Correction and
                                    and XQuery
                                        XQuery Analysis
                                               Analysis with
                                                        with Analyzer
                                                             Analyzer      71
                                                                           11


Element                 Use    Test      Element             Use    Test
                      Cases Suite                          Cases Suite
AtomicType            0.27% 2.49%        KindTest          4.83% 0.80%
Axis                 14.28% 4.79%        LetClause         1.25% 0.32%
BaseURIDecl               — 0.04%        Literal           4.61% 20.32%
BindingSequence       3.62% 1.11%        ModuleDecl        0.07% 0.00%
BoundarySpaceDecl         — 0.07%        ModuleImport      0.07% 0.03%
CData                     — 0.01%        Name              4.21% 2.14%
CaseClauses           0.02% 0.03%        NameTest         10.14% 4.08%
CharRef                   — 0.02%        NamespaceDecl     0.20% 0.18%
CommaOperator         0.04% 2.04%        OperandExpression 0.02% 0.03%
ConstructionDecl          — 0.04%        Operator          3.85% 8.43%
Constructor           3.36% 2.07%        OptionDecl            — 0.01%
Content               4.03% 2.11%        OrderedExpr           — 0.01%
ContextItem           0.22% 0.11%        OrderingModeDecl      — 0.02%
CopyNamespacesDecl        — 0.02%        Path             10.02% 2.51%
DefaultCase           0.02% 0.03%        PragmaList            — 0.03%
DefaultCollationDecl      — 0.01%        QuantifiedExpr    0.27% 0.15%
DefaultNamespaceDecl      — 0.12%        QueryBody         1.83% 10.70%
ElseExpression        0.07% 0.08%        ReturnClause      2.04% 0.90%
EmptyOrderDecl            — 0.03%        SchemaImport      0.38% 0.17%
EmptySequence         0.02% 0.63%        String            6.82% 2.12%
EntityRef                 — 0.01%        TestExpression    0.34% 0.23%
Extension                 — 0.03%        ThenExpression    0.07% 0.08%
FLWOR                 1.97% 0.79%        TupleStream       1.97% 0.79%
ForClause             2.10% 0.59%        Type              0.98% 2.62%
FunctionBody          0.40% 0.16%        Typeswitch        0.02% 0.03%
FunctionCall          5.77% 17.11%       UnorderedExpr         — 0.01%
FunctionDecl          0.40% 0.16%        ValidateExpr          — 0.02%
Hint                  0.38% 0.01%        VarDecl               — 2.42%
IfExpr                0.07% 0.08%        VarRef            8.68% 3.47%
InClauses             0.27% 0.15%        VarValue              — 2.42%

               Table 1. The elements of the internal representation


Element               Use    Test        Element            Use Test
                    Cases Suite                            Cases Suite
child              71.63% 82.67%         following            — 0.44%
descendant              — 0.21%          parent               — 0.50%
attribute           5.33% 3.70%          ancestor             — 0.44%
self                    — 0.36%          preceding-sibling    — 0.42%
descendant-or-self 23.04% 10.40%         preceding            — 0.42%
following-sibling       —      —         ancestor-or-self     — 0.44%

                               Table 2. Axis usage
72
12      Jakub Analysis
        Query Stárka etand
                         al. Document Correction with Analyzer


References
 1. Afanasiev, L., Marx, M.: An analysis of xquery benchmarks. Inf. Syst. 33(2), 155–
    181 (2008)
 2. B. Bouchou, A. Cheriat, M. H. Ferrari Alves, A. Savary: Integrating Correction
    into Incremental Validation. In: BDA (2006)
 3. Bex, G.J., Neven, F., den Bussche, J.V.: Dtds versus xml schema: a practical study.
    In: WebDB ’04. pp. 79–84. ACM, New York, NY, USA (2004)
 4. Boag, S., Chamberlin, D., Fernndez, M.F., Florescu, D., Robie, J., Simon, J.:
    XQuery 1.0: An XML Query Language (Second Edition). W3C (December 2010),
    http://www.w3.org/TR/xquery/
 5. Bray, T., Paoli, J., Sperberg-McQueen, C.M., Maler, E., Yergeau, F.: Extensible
    Markup Language (XML) 1.0 (Fifth Edition). W3C (November 2008), http://
    www.w3.org/TR/xml/
 6. Chamberlin, D., Fankhauser, P., Florescu, D., Marchiori, M., Robie, J.: XML Query
    Use Cases. W3C (March 2007)
 7. Corrector Prototype Implementation, http://www.ksi.mff.cuni.cz/~svoboda/
 8. Draper, D., Fankhauser, P., Fernández, M., Malhotra, A., Rose, K., Rys, M.,
    Siméon, J., Wadler, P.: XQuery 1.0 and XPath 2.0 Formal Semantics. W3C (Jan-
    uary 2007)
 9. Gao, S., Sperberg-McQueen, C.M., Thompson, H.S.: W3C XML Schema Definition
    Language (XSD) 1.1 Part 1: Structures. W3C (December 2009), http://www.w3.
    org/TR/xmlschema11-1/
10. Mignet, L., Barbosa, D., Veltri, P.: The xml web: a first study. In: Proceedings of
    the 12th international conference on World Wide Web. pp. 500–510. WWW ’03,
    ACM, New York, NY, USA (2003), http://doi.acm.org/10.1145/775152.775223
11. Mlynkova, I., Toman, K., Pokorny, J.: Statistical analysis of real xml data collec-
    tions. In: COMAD’06. pp. 20–31. Tata McGraw-Hill Publishing, New Delhi, India
    (2006)
12. S. Flesca, F. Furfaro, S. Greco, E. Zumpano: Querying and Repairing Inconsistent
    XML Data. In: WISE ’05. LNCS, vol. 3806/2005, pp. 175–188. Springer (2005)
13. S. Staworko, J. Chomicky: Validity-Sensitive Querying of XML Databases. In: Cur-
    rent Trends in Database Technology EDBT 2006, DataX06. Lecture Notes in
    Computer Science, vol. 4254/2006, pp. 164–177. Springer (2006)
14. Stárka, J., Svoboda, M., Sochna, J., Schejbal, J.: Analyzer 1.0. http://analyzer.
    kenai.com/
15. Svoboda, M.: Processing of Incorrect XML Data. Master Thesis, Charles Univer-
    sity in Prague, Czech Republic (September 2010), http://www.ksi.mff.cuni.cz/
    ~mlynkova/dp/Svoboda.pdf
16. Svoboda, M., Mlýnkova, I.: Correction of Invalid XML Documents with Respect
    to Single Type Tree Grammars. In: NDT 2011. Communications in Computer and
    Information Science, vol. 136. Springer, Macau, China (2011), [to be published]
17. Svoboda, M., Stárka, J., Sochna, J., Schejbal, J., Mlýnkova, I.: Analyzer: A frame-
    work for file analysis. In: BenchmarX ’10. pp. 227–238. Springer-Verlag, Tsukuba,
    Japan (2010), http://www.springerlink.com/content/078819t6645j6268/
18. U. Boobna, M. de Rougemont: Correctors for XML Data. In: Database and XML
    Technologies. LNCS, vol. 3186/2004, pp. 69–96. Springer (2004)
19. W3C: XML Query Test Suite (November 2006)
20. Z. Tan, Z. Zhang, W. Wang, B. Shi: Computing Repairs for Inconsistent XML
    Document Using Chase. In: Advances in Data and Web Management. LNCS, vol.
    4505/2007, pp. 293–304. Springer (2007)

</pre>