=Paper=
{{Paper
|id=Vol-101/paper-9
|storemode=property
|title=Robustness of External Annotation for WebPage Clipping: Empirical Evaluation with Evolving RealLife Web Documents
|pdfUrl=https://ceur-ws.org/Vol-101/Masahiro_Hori-et-al.pdf
|volume=Vol-101
}}
==Robustness of External Annotation for WebPage Clipping: Empirical Evaluation with Evolving RealLife Web Documents==
Robustness of External Annotation for Web-Page Clipping:
Empirical Evaluation with Evolving Real-Life
Web Documents
∗
Masahiro Hori Mari Abe Kouichi Ono
Faculty of Informatics, Kansai IBM Tokyo Research IBM Tokyo Research
University Laboratory Laboratory
2-1-1 Ryozenji-cho, 1623-14 Shimotsuruma, 1623-14 Shimotsuruma,
Takatsuki-shi Yamato-shi Yamato-shi
Osaka 569-1095, Japan Kanagawa 242-8502, Japan Kanagawa 242-8502, Japan
horim@res.kutc.kansai- maria@jp.ibm.com onono@jp.ibm.com
u.ac.jp
ABSTRACT technologies of customizing Web documents suitable for de-
Web metadata is crucial for providing machine-understandable livery context. The customization or document adaptation
descriptions of Web resources, and has a number of appli- requires annotation that indicates the ways of modifying the
cations such as discovery, qualification, and adaptation of document at hand.
Web documents. While annotations are often embedded Annotations can be embedded into a Web document as in-
into a Web document, annotations can also be associated line annotations, which are often created as extra attributes
externally by means of addressing expressions represented of document elements. Most existing HTML browsers ignore
with the XPath language. However, there has been little unknown attributes added to HTML elements, without be-
empirical study of robust pointing using XPath expressions, ing bothered by the proprietary inline annotations. Because
in spite of the increasing prevalence of the XPath language of its simplicity, inline annotation has been often adopted as
not only for use with XSLT, but also in emerging content a way of associating annotation with HTML documents [22,
adaptation systems. The objective of this study is to evalu- 25, 10, 12]. An advantage of the inline approach is the ease
ate the robustness of XPath expressions especially the ones of annotation maintenance without the bookkeeping task of
that can be used for the generation of external annotations, associating annotations with their target document. The
and draw practical implications to the reliable use of exter- inline approach, however, requires annotators to have doc-
nal annotation. ument ownership because annotated documents need to be
modified whenever inline annotations are created or revised.
On the other hand, the external annotation approach [13]
1. INTRODUCTION does not suffer from these issues related to document own-
Web annotation is crucial for providing not only human- ership. The important point of the external annotation ap-
readable remarks, but also machine-understandable descrip- proach is that it facilitates the sharing and reuse of anno-
tions, and has a number of applications such as discovery, tations across Web documents. In addition, the mixing of
qualification, and adaptation of Web contents [19]. As more content and metadata is not desirable with regard to the
and more Web-enabled personal devices are becoming avail- design guideline that content should be separated from pre-
able for connecting to the Internet, the same Web docu- sentation. Therefore, it is assumed in this study that such
ments need to be rendered differently on different client de- metadata is maintained separately from a target document,
vices. Adaptation of Web document to delivery context is and exploited dynamically at runtime by a content adapta-
thus crucial for transparent Web access, which may depend tion engine.
on client capabilities, network connectivity, or user prefer- Since Web documents may change over time, it is not
ences [8]. The long-term goal of our research is to establish always obvious what kinds of addressing expression keep
pointing the same target element regardless of the document
∗This author also belongs to Graduate School of Science and
changes. It was reported that a key complaint in the use of
Technology, Keio University. electronic annotation was the situation in which an anno-
tation cannot point any portion of a target document [4].
These are aspects of the issue related to robust positioning,
which has been investigated in a couple of empirical studies
[24, 3]. However, there has been little empirical study of
Permission to make digital or hard copies of all or part of this work for robust pointing using XPath expressions, in spite of the in-
personal or classroom use is granted without fee provided that copies are creasing prevalence of the XPath language not only for use
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to with XSLT, but also in emerging content adaptation systems
republish, to post on servers or to redistribute to lists, requires prior specific [13, 26, 23, 2].
permission and/or a fee. The objective of this study is to evaluate the robustness
Copyright 2000 ACM 0-89791-88-6/97/05 ..$5.00
of XPath expressions especially the ones that can be used
for the generation of external annotations, and draw prac- Table 1: Variations in annotation tool design
tical implications to the reliable use of external annotation. Authoring methods
In the next section, we introduce variations in annotation
tools, on the basis of two authoring methods (annotation by By s elec t i o n By example
selection and by example) as well as the different roles of
A s s er t i o n (a) N/A
annotations for assertion and transformation. Section 3 ex- R ol es of
plains an annotation language for Web page clipping, which a nnota tion
T r an s f o r mat i o n (b) (c)
has been adopted for commercially available software prod-
ucts, and its application to page clipping for small-screen de-
vices and portal site development. In Section 4, we present
an empirical evaluation of the robustness of XPath expres- annotation by example would be much easier for annotators,
sions with regard to the changes in real-life HTML pages. In because the annotators can work with a concrete example
particular, it was investigated to what extent those expres- and create a desired result interactively with the example.
sions continued to point at the same nodes in the modified In particular, the example-based method allows annotators
pages during the observation period of one year and three to generate transformational annotations on the basis of an-
months. Finally, we discuss the advantages and limitations notators’ editing operations conducted to come up with a de-
of the XPath expressions taking account of the generation sired result. This type of annotation tools follows transfor-
of external annotations. mational annotation by example [Table 1(c)]. This example-
based method is particularly useful for the transformational
2. VARIATIONS IN ANNOTATION TOOLS annotations, and would not make sense for assertional anno-
An annotation in general declares properties that qualify tations, because it is not intuitive for annotators to indicate
a particular portion of a target document. In some cases, assertional annotations as results of structural changes of a
however, annotations may indicate structural changes for target document.
the annotated portion of a target document. In order to The core part of the tool configuration is independent of
clarify the distinction of these two roles, the former is called any particular views and editors, and consists of two docu-
assertional annotation, while the latter transformational an- ment object models (DOMs) [9]: one for a target document,
notation [14]. Note here that this distinction is not exclusive, and another for an annotation document. It is assumed here
because every annotation is intrinsically an assertion. that the creation of an annotation document is a primary
Transformational annotation has been used for Web con- task of annotators, and the annotators are not allowed to
tent adaptation, in which structural changes of a target doc- modify a target document. The assertional annotation by
ument are needed [13, 23, 26, 28]. In contrast to assertional selection is the most typical and provide a comprehensive
annotation languages (such as Dublin Core Metadata [7]), way of annotation. The details of this type of annotation
transformational annotation languages (such as XSLT [28]) tool is reported in another article [1, 16]. According to our
are more like programming languages, and not necessary interests in customization of Web documents, this paper fo-
easy for annotators to create transformational annotations cuses on the transformational annotation, and in particular
by using conventional annotation by selection approach. the two advanced approaches to generating annotations by
It is simple for annotation authors or annotators to in- selection [Table 1(b)] and by example [Table 1(c)].
dicate a location to be annotated and create an assertion
as annotation content. This is an approach that we call 2.1 Transformational Annotation by Selection
annotation by selection, and is adopted by existing annota- Figure 1 depicts the configuration for transformational
tion tools [1, 5, 10, 13, 17, 23, 11]. On the other hand, for annotation by selection. This type of annotation tools relies
transformational annotations, it is easier for annotators to on a target document viewer, because portions of a target
modify a target document toward the desired results of the document can only be selected without any modification.
customization, rather than to indicate the ways of modifi- With this type of annotation tools, first an annotator opens
cations declaratively as assertional annotations. This is a a target document to be customized. The annotator then
basic idea behind an approach what we call annotation by selects portions of the target document by using a document
example, which was originally proposed in our previous work viewer [Figure 1 (a)], and indicates how each of the selected
on the generation of XSLT rules [18]. portion to be modified (e.g., remove and enlarge). Trans-
According to the distinctions of annotation authoring meth- formational annotation can then be generated [Figure 1 (b)]
ods and the roles of annotation, Table 1 summarizes varia- on the basis of the portions of a target document selected
tions in annotation tools. Annotators can select a portion of by an annotator.
document to be annotated and declare properties on the se-
lected portion as assertional annotation. This type of anno- 2.2 Transformational Annotation by Example
tation tools support assertional annotation by selection [Ta- If a person knows how to perform a task to be executed
ble 1(a)], and most of the existing annotation tools fall into by a computer, perhaps the person’s knowledge can some-
this category. Even when annotations are used for structural how be exploited for the creation of a program to perform
changes of a target document, it is possible for authors to the task. This is the idea behind programming by exam-
create transformational annotations by selecting portions to ple [20]. Programming by example is a natural approach to
be changed and declare instructions of transformations as generating transformational annotation for page designers
annotations. This type of annotation tools support trans- or novice programmers, because such users need only work
formational annotation by selection [Table 1(b)]. with examples of how to transform a document at hand, and
In order to create transformational annotations, however, are given with generated annotations that can replicate the
Core components Selection-based (a) Target document
O ri gi nal
annotation generation p age Transcoding H TTP
Target
P rox y S e rv e r (U R L -x)
T a r g et (a) U R L -x
D o c u m en t
D o cu m en t Viewer C l ip p ing
e ngine
annotate U R L -x
A A nnotati
nnotati on
on
T r a n s f o r m a ti o n a l (b) A n n o tati o n Selected (b ) document
document
A n n o ta ti o n G en erato r n o des C l i p p ed
p age
Figure 1: Tool configuration for transformational
Figure 3: Overview of an annotation-based
annotation by selection
transcoding
Core components Example-based
annotation generation
are created for the desktop computers. However, it may
T arg et Target be necessary to deliver portions of this page for users to
(a) C u s tom iz ed
D o c u m en t
D oc u m ent D oc u m ent access through a Web-enabled phone rather than a desk-
Editor
(b) top browser. In such a case, the images and nested HTML
annotate
tables prepared for a nicely laid out page are a hindrance
T rans f orm ational (c) A n n o tati o n Operation rather than help. The sheer amount of information becomes
A nnotation G en erato r h is tory unwieldy in the small display, and potentially expensive de-
pending on the user’s wireless service.
Transformational annotations provide additional informa-
Figure 2: Tool configuration for transformational tion about Web documents, so that an adaptation engine
annotation by example can make better decisions on the content transformation.
The role of annotations here is to provide explicit semantics
that can be understood by a content adaptation engine [15].
same transformation.
A configuration of the example-based annotation tool is 3.1 Page Clipping for Small-Screen Devices
depicted in Figure 2. This type of annotation tools relies on An overview of an annotation-based transcoding process
a target document editor rather than a viewer in contrast to is depicted in Figure 3. Upon receipt of a request from a
the case of annotation by selection (see Figure 1). With this client, a Web document is retrieved from a content server.
type of annotation tools, first an annotator opens a target Taking account of the capabilities of the client specified in
document to be customized (e.g., an HTML file). The anno- the HTTP request header, a transcoding proxy selects one
tator then edits the document by using the full capabilities of or more transcoding modules. When a selected transcod-
a WYSIWYG authoring tool [Figure 2 (a)]. Although the ing module requires an annotation document, an annota-
annotator’s editing actions are recorded into an operation tion file is also retrieved from a content server, which may
history [Figure 2 (b)], the annotator does not have to care or may not be the same server that retrieved the Web docu-
about the recording process behind the scenes. When the ment. The transcoding module may simply return the origi-
editing is finished, the annotator will have a customized doc- nal document, if a client agent has the rendering capabilities
ument. At the same time, the annotation generator creates compatible with ordinary desktop computers [Figure 3 (a)].
transformational annotation for the document customiza- Alternatively, the original document may be returned with
tion [Figure 2 (c)], which can be used by a runtime engine modification, so that the original content can fit into a small
(e.g., XSLT processor) to replicate the transformation from screen device [Figure 3 (b)]. The decisions about the content
the initial target document to the customized document. adaptation are made taking account of the client capabilities
Further details on the annotation generation procedure are specified in the HTTP request header.
reported in the other articles [14, 18]. Content adaptation can be done by using an annotation-
It is possible for both selection-based and example-based based page-clipping engine [26]. At content delivery time,
approaches to generate transformational annotations, but the page-clipping engine may modify the original document
the selection-based approach is limited in the kinds of an- with reference to page-clipping annotations and client pro-
notation constructs to be generated as compared with the files sent over HTTP. The main idea in the page-clipping
example-based approach, because the expressiveness of an- annotation language is the notion of a clipping state. By
notators’ selection on a document viewer is far more limited using and elements in the annotation de-
than that of annotators’ full editing capability on a docu- scriptions, users can specify the clipping state to indicate
ment editor. whether the content being processed should be preserved or
removed.
As a simple example, an HTML page and its clipped re-
3. ANNOTATION-BASED DOCUMENT ADAP-
sults are shown in Figure 4. In this example, the header and
TATION the first paragraph are preserved as shown in Figure 4(a).
Web pages for e-commerce, for example, contain a lot of The table element is modified by deleting the third column
information such as details of products, product images, and and the second row. The cell-padding attribute of the table
numerous links to other areas of the site, when the pages is increased, so that each table cell can be provided with
Original page C lipped page ument elements encountered are preserved, until otherwise
instructed by another annotation statement. The clipping
state is changed to ’remove’ just before the second ele-
(a) pres erv e
ment [Figure 5(c)], and changed back to ’keep’ after the
element [Figure 5(d)]. As results, the second paragraph ele-
ment indicated by /HTML[1]/BODY[1]/P[2] is removed
while preserving the elements just before and after the re-
(b ) m o d if y
moved element.
Since HTML tables can often be complex elements to clip,
(c ) rem o v e the annotation language provides special-purpose elements
to make table clipping easier. The and el-
ements allow user to clip rows and columns without relying
Figure 4: Simple example of an HTML page clipping on complicated XPath expressions. The table-clipping el-
ements are used in the description shown in Figure 5(b).
This description sets the clipping state to ’keep’ just be-
fore the first table element, and also changes the value of
target="/HTML[1]/BODY[1]/*[1]"> element. The name attribute of can be
specified with an arbitrary name of an attribute available
for a target document.
In addition, the description element [Figure 5(b)] declares
that the third column, which is indicated by the index value
of the element, is discarded, while the remaining
columns are preserved. Note here that the wildcard char-
acter to indicate multiple columns (index="*"). If a wild-
card is specified, all rows (or columns) will be affected, ex-
cept for those specifically indicated by a separate (or
) element. So, all rows but the second are pre-
served for the target table.
3.2 Page Clipping for Portal Site Development
Annotation-based page clipping is a useful technique for
of small-screen devices, but the advantages are not limited
to device adaptation. Another promising application of the
page clipping technology is the use in Web portals. Web por-
it can provide a single point of comprehensive, integrated
access to both Web data and applications. However, each
of the Web data or application is for the most cases pro-
vided assuming to be presented on a desktop browser, and
would be too spacious to fit into a small area in a portal
Figure 5: Example of page-clipping annotations page. Page clipping is thus useful for Web pages that are
aggregated into a portal site.
Figure 6 illustrates the process of creating a portal page
margin space [Figure 4(b)]. In addition, the whole of the with an annotation-based clipping portlet. Portlets are spe-
second paragraph is removed as shown in Figure 4(c). All cialized servlets that plug into and run in portals, and allow
the structural changes in HTML documents can be easily to generate dynamic contents. When a portal server receives
done by using a WYSIWYG HTML editor.
Figure 5 shows an annotation document that realizes the
page clipping illustrated in Figure 4. This transformational H T T P
P orta l P a ge re q u e s t P orta l Server
annotation can actually be generated by using the example- Content Server A
based annotation generation tool [14]. The Clipped Clipping P P agag e e Annotation
Annotation
element prescribes a unit of an annotation statement in pa ge portlet ( ( a)a) d d ococ u u m m e e nt
nt
the annotation language. The target attribute is set to
an XPath expression, and identifies the node on which the
O th er Content Server B
O th er
annotation will be applied, and the take-effect attribute pa ge portlet P P agag e e
indicates whether the annotation is applied before or af- ((bb ))
ter the target node. By specifying the value of target at- H T T P
r e s p ons e
tribute as /HTML[1]/BODY[1]/*[1] [Figure 5(a)], the
clipping state is activated after the first element after the
first element, which in this case is an . The Figure 6: Creation of a portal page with annotation-
element in Figure 5(a) indicates that all the doc- based clipping portlet
A nnotati on T ool
Portal Page on Browser
Ad ap tation b y
tr ans f or m ational
annotation
Annotation
b y s e l e c tion
Figure 7: Annotation tool for Web clipping portlet
an HTTP request, the server dispatches the request to each ular node is not found. This type of addressing expressions
portlet aggregated in the page, and collects the results into is what we call single-node pointing expressions, which are
a portal page to be returned (Figure 6). suitable for automatic generation of addressing expressions,
Figure 7 shows a screen of an annotation tool for clip- because single-node pointing expressions can be generated
ping portlet in the left, and a portal page that includes the solely on the basis of focal nodes without human interven-
clipped page in the right. This annotation tool allows a user tion.
to select the portions of the original page to be removed in Although the XPath language [27] provides thirteen axes
the portal page [cf., Figure 1 (a)], and the annotation gen- (such as ancestor, descendant, following, and preceding) for
erator creates page-clipping annotations from the selected specifying the direction of node-set selection in a location
nodes [cf., Figure 1 (b)]. step, there exist only the two axes, namely, the child and
The selection-based annotation generation was actually descendant axes that can point to every element from a doc-
adopted for a software product of an annotation tool for a ument root element, using only one kind of axis for every
portal server, and extensively used in the development of location step with position number predicate. Therefore,
a supplier portal of an automotive company. In this case, there are the only two kinds of single-node pointing expres-
the automotive company extensively used the page-clipping sions created by using only one axis. One is to create an
portlet with the annotation tool solely for the simple expression pointing to an element of the target document
and clipping operations. The primary reason for using a sequence of child-position location steps (ChildPos-
the customer’s choice was just the simplicity of the author- Seq). Another is for pointing to an element by means of a
ing process without advanced annotation constructs for page descendant position (DescendantPos).
clipping. Since the automotive company needs to aggregate The ChildPosSeq expression simply follows the hierarchy
several thousands of existing pages into the portal site, it of DOM tree from the root to a target element, and points
was not practical to create sophisticated clipping annota- to at most one element by a sequence of child positions (e.g.,
tions for page by page, and it was reasonable to provide just /html[1]/body[1]/table[2]/tbody[1]/tr[1]/td[2]). The
simple clipping capability to remove headers and side menus DescendantPos expression, on the other hand, indicates the
in the original documents that were created for browsers on number of an element with the same tag name in the doc-
desktop computers. ument order among all the descendant nodes from the root
node, and points to at most one element by a descendant
position (e.g., /descendant::table[8]).
4. EMPIRICAL EVALUATION The two annotation tools mentioned above have been us-
The page-clipping annotation explained in the previous ing the ChildPosSeq expression, and the DescendantPos was
section is adopted for commercially available software prod- not used for the generation of addressing expressions. In
ucts, and has been used for a number of real applications for the remainder of this section, we present an empirical eval-
Web document adaptation. Moreover, the above-mentioned uation of the robustness of XPath expressions, in order to
annotation generation tools have been provided as toolkits draw practical implications to the reliable use of external
for the software products: the example-based annotation annotation.
tool for transcoding proxy [26] and the selection-based an-
notation tool for portal server development [6]. 4.2 Evaluation Method and Results
Table 2 shows the basic data of the observed HTML pages.
4.1 Single-Node Pointing Expression The pages A and B are a corporate top page and a product
External annotations generated by these two tools use a page of the same company. The page C is a top page of a
type of addressing expressions that points to at most one news media company, while the page D belongs to a software
particular node, and will not point at anything if the partic- company. These pages were saved each day during about the
1000
Table 2: Basic data of the observed HTML pages
Number of nodes with ID attribute
Number of Depth of 800
Page URI nodes per page document tree
[ave. (max, min)] [ave. (max, min)]
A www.ibm.com/ 393 (441, 348) 21 (21, 20) 600 Page A
B www.ibm.com/products/ 709 (758, 623) 27 (30, 20) Page B
C public.wsj.com/ 952 (1333, 433) 22 (24, 21) Page D Page C
400 Page D
D java.sun.com/ 909 (1311, 325) 22 (29, 15)
The number of sample pages is 540 for each reference page. Page B
200
Page A Page C
(a) 0
Reference page 1 51 101 151 201 251 301 351 401 451 501
Reference page with IDs
Number of days
Inserting IDs IDs
(b) Figure 9: Number of nodes with ID attributes dur-
ing the observation period
Differencing Patching IDs
Sample pages Difference files Sample pages (b)
(1 .. N) (1 .. N) with IDs (1 .. N) A test set
of XPath
expressions
Figure 8: Process of node ID insertion
(a)
period of one year and three months, and 540 pages for each
are collected as samples for this investigation. IDs 1 … (k) … N
In order to make sure that an XPath expression actually
points to the same node in a reference page, it is necessary Reference page Sample pages with IDs
to know the node-to-node mapping between the reference with IDs
page and other sample pages with the same URI that are
collected during the observation period. For the purpose Figure 10: Process of testing XPath expressions
of this empirical study, an ID attribute was added to every
DOM node of the reference and sample pages. Note that
the most of existing HTML pages are not well-formed, and tion, like the most tree differencing and merging tools, the
therefore cannot be reliably pointed to by XPath expres- 3DM algorithm regards an attribute-value change as dele-
sions. It is assumed here that both the reference page and tion of the modified node followed by insertion of the same
sample pages are parsed by an HTML parser in advance, node with updated attribute value. This means the ID at-
and converted to DOM trees before the node ID insertion tributes disappear when a node is updated as results of an
process. attribute-value change.
Figure 8 illustrates the process of node ID insertion, which Figure 9 shows the numbers of nodes ID attributes for
can be done in the following two steps. The fist step is to add each sample page along the 540-day observation period. For
unique identifiers as an attribute value (e.g., uid="N27") to the pages A and B, there is a sudden decrease in the number
every node in the reference page [Figure 8 (a)]. Note that of nodes with IDs just after the 50th day due to the small
comment and text nodes were excluded from the ID inser- style changes. In addition, since the site design for the pages
tion, because an attribute cannot be added to those nodes. C and D has been changed, the numbers of nodes with IDs
The second step consists of differencing and patching [Fig- were suddenly decreased in the page C after the 232nd day
ure 8 (b)]. The DOM-tree difference was calculated taking and the page D after the 367th day. The numbers in the
account of the changes in each sample page as compared page D tentatively decreased during about 10 days from the
with a reference page with the same URL. Each difference 272nd day. This is because the site design was temporally
file consists of a sequence of edit operations that transform changed due to the annual developer conference1 held by that
the reference page into a corresponding sample page. The company.
difference files are then applied to the reference page with The number of nodes with IDs decreases as results of
unique identifiers. Finally, as results of the patching, we can the deletion, insertion, and update of DOM nodes. Note
obtain a set of sample pages with unique identifiers. As a here that the differencing algorithm was always applied to
differencing and merging tool for XML documents, we used obtain node-to-node mappings from a reference page to a
the 3DM tool [21]. sample page. Therefore, the difference from the reference
It is straightforward to insert ID attributes to all the page would not necessarily increase monotonically, but may
DOM nodes in the reference pages. However, due to the be reduced later if the page were modified again to be closer
changes in Web pages over time, some nodes will be inserted to the reference page. This is why the number of nodes with
into or deleted from the documents. The newly inserted IDs increases at some points in the observed period.
nodes do not have any ID attribute, because they cannot
1
be mapped from any node in the reference page. In addi- The 2002 JavaOne Conference
100 Number of nodes
0 400000 800000 1200000
Precentage of correct pointing
80
IDValueMatch
60
ChildPosSeq Exact
DescendantPos ChildPosSeq Exclusive
40
Nonexistent
20
DescendantPos
0
Figure 12: Number of nodes with detailed distinc-
Figure 11: Percentage of correctly pointed nodes for tion of pointing status
each XPath expression
attribute we created in the ID-insertion process (Figure 8).
Every DOM node with a unique identifier in a reference The IDValueMatch corresponds to a situation when every
page was regarded as a reference node. For each reference node of a document at hand is provided with an unique
node, we created ChildPosSeq and DescendantPos expres- identifier. Although the attribute name “uid” might be ac-
sions, so that they can point to the corresponding reference cidentally duplicated in some sample pages, no duplication
node. This set of XPath expressions constituted a test set was not found within the pages examined for this study.
[Figure 10 (a)]. Each expression in the test set was then Besides the numbers of the exact (correct) pointing, one
applied to sample pages with the same URI [Figure 10 (b)]. of the notable feature in Figure 12 is the large number of
If an XPath expression actually points to a node with the the exclusive pointing errors for the DescendantPos. An
same ID as the reference node, the expression was regarded exclusive pointing error means that an XPath expression
as correct in the sample page. Note that an XPath ex- points to a node or a set of nodes that does not include the
pression may point to multiple nodes in a target document. target node at all.
However, since this evaluation only deals with single-node If we consider only exact pointing, this example is not
pointing expressions, correct expressions must point to the so serious as simply a case of incorrect pointing. However,
only one node with the same ID in a sample page, and must when we use XPath expressions for external annotations,
not point to any other nodes in the sample page. we need to further think about the actual influences of such
Figure 11 shows the percentages of nodes that are cor- incorrect pointing with regard to the behavior of the run-
rectly pointed to by each type of expressions. The Child- time engines such as XSLT processors for XSLT stylesheets.
PosSeq always scores higher percentage than DescendantPos Therefore, it is also important to investigate the robustness
expression. In the next section, we will further investigate of XPath expressions taking account of application scenar-
the robustness of the two types of expressions. ios.
5. DISCUSSION 6. CONCLUDING REMARKS
A correctly pointing expression points to a target node In this paper, we presented variations in annotation tools,
that is identical with the reference target node as identi- and explained the two types of tools that generate transfor-
fied by the ID-attribute value, and does not points to any mational annotation for Web document clipping. Since the
other nodes other than the reference target. Here we call transformational annotations are descriptions of the ways of
this correct pointing as exact pointing. When an XPath ex- modifying the document at hand, the semantics of the docu-
pression does not correctly point to a target node, the ways ment adaptation can easily be indicated through annotator’s
of incorrect pointing can be categorized into three types: demonstration or editing actions to obtain the desired result
nonexistent, inclusive, and exclusive. Nonexistent pointing of adaptation. Although the example-based annotation tool
is when an XPath expression points to nothing. Inclusive is the most sophisticated approach to creating transforma-
pointing is when an expression points to a node set that in- tional annotation, it may suffer from difficulties in inferring
cludes not only the target node but also nodes other than the annotator’s intension behind the editing process. On
the target. Exclusive pointing is when an expression points the other hand, the selection-based annotation is a simpler
to a node or a set of nodes that does not include the target approach, and limited in the capabilities of annotation gen-
node at all. eration. However, it is noteworthy that the simplicity was a
Figure 12 divides the nodes up by pointing status. The to- real advantage for use in the development of clipping portlet,
tal number of tested nodes was more than a million (1,280,880), because it was not practical to create sophisticated clipping
counting all the HTML elements included in the sample annotations for several thousands of pages to be aggregated
pages throughout the entire observation period. The IDVal- into a supplier portal.
ueMatch expressions rely on the ID-attribute value of the Finally, the empirical study presented in this paper is
target node. An example of the IDValueMatch expression tightly bound to the specific HTML pages we investigated,
is //*[@uid=’N35’], where the “uid” is the name of the ID and we know that our sample sizes are too small to have
statistical validity. Therefore, we are not claiming that the [14] Hori, M., Ono, K., Koyanagi, T., and Abe, M.:
results can easily be generalized to all the other kinds of Annotation by transformation for the automatic
HTML documents. However, this empirical study is an generation of content customization metadata. In F.
important step towards establishing hypotheses regarding Mattern and M. Naghshineh (Eds.) Pervasive
phenomenon that may hinder the practical use of external Computing, First International Conference, Pervasive
annotations or metadata that exploit XPath expressions. 2002, Lecture Notes in Computer Science 2414,
pp. 267–281, Zurich, Switzerland (2002).
7. REFERENCES [15] Hori, M.: Semantic annotation for Web content
[1] Abe, M. and Hori, M.: A visual approach to authoring adaptation. In D. Fensel, J. Hendler, H. Lieberman,
XPath expressions. Proceedings of Extreme Markup and W. Whalster (Eds), Spinning the Semantic Web,
Languages 2001, pp. 1–14 Montréal, Canada (2001). pp. 542–573, MIT Press, Boston, MA (2002).
[2] Asakawa, C. and Takagi, H.: Transcoding system for [16] Hori, M., Abe, M. and Ono, K.: Extensible framework
non-visual Web access (2): annotation-based of authoring tools for Web document annotation.
transcoding. Sixteenth International Conference on Proceedings of International Workshop on Semantic
Technologies and Persons with Disabilities Web Foundations and Application Technologies
(CSUN2001) (2001). (SWFAT), pp. 1-8, Nara, Japan (2003).
[3] Brush, A. J., Bargeron, D., Gupta, A., and Cadiz, J. [17] Kahan, J. and Koivunen, M.-R.: Annotea: an open
J.: Robust annotation positioning in digital RDF infrastructure for shared Web annotations.
documents. Proceedings of the 2001 ACM Conference Proceedings of the 10th International World Wide
on Human Factors in Computing Systems (CHI 2001), Web Conference, pp. 623–632, Hong Kong (2001).
pp. 285–292, Seattle, Washington (2001). [18] Koyanagi, T., Ono, K., and Hori, M.: Demonstrational
[4] Cadiz, J. J., Gupta, A., and Grudin, J.: Using Web Interface for XSLT Stylesheet Generation. Markup
annotations for asynchronous collaboration around Languages: Theory & Practice, 2(2): 133–152 (2001).
documents. Proceedings of ACM 2000 Conference on [19] Lassila, O.: Web metadata: a matter of semantics.
Computer Supported Cooperative Work (CSCW 2000), IEEE Internet Computing, 2(4): 30–37 (1998).
pp. 309–318, Philadelphia, PA (2000). [20] Lieberman, H. (Ed.): Your Wish is My Command:
[5] Denoue, L. and Vignollet, L.: An annotation tool for Programming by example. Morgan Kaufmann
Web browsers and its applications to information Publishers, San Francisco (2001).
retrieval. Proceedings of the 6th Conference on [21] Lindholm, T.: A 3-way merging algorithm for
Content-Based Multimedia Information Access (RIAO synchronizing ordered trees – The 3DM merging and
2000), Paris, France (2000). differencing tool for XML. Master Thesis, Department
[6] DeWitt, S. : Basic Web Clipping Using WebSphere of Computer Science, Helsinki University of
Portal Version 4.1. IBM WebSphere Developer Domain, Technology (2001).
http://www7b.software.ibm.com/wsdd/library/techarticles/ [22] Mea, V. D., Beltrami, C. A., Roberto, V., and
0206 dewitt/dewitt.html (2002). Brunato, D.: HTML generation and semantic markup
[7] Dublin Core Metadata Element Set, Version 1.1: for telepathology. Proceedings of the 5th International
Reference Description. Dublin Core Metadata World Wide Web Conference, pp. 1085–1094, Paris,
Initiative, Recommendation, France (1996).
http://dublincore.org/documents/dces/ (1999). [23] Nagao, K., Shirai, Y., and Kevin, S.: Semantic
[8] Device Independence Principles. W3C Working Draft, annotation and transcoding: making Web content
http://www.w3.org/TR/di-princ/ (2001). more accessible. IEEE Multimedia, 8(2): 69–81 (2001).
[9] Document Object Model (DOM) Level 1 Specification [24] Phelps, T. A. and Wilensky, R.: Robust
Version 1.0. W3C Recommendation, intra-document locations. Proceedings of the 9th
http://www.w3.org/TR/REC-DOM-Level-1/ (1998). International World Wide Web Conference,
[10] Erdmann, M., Maedche, A., Schnurr, H.-P., and pp. 105–118, Amsterdam, Netherlands (2000).
Staab, S.: From manual to semi-automatic semantic [25] Rousseau, J. F., Macias, A. G., de Lima, J. V., and
annotation: about ontology-based text annotation Duda, A.: User adaptable multimedia presentations
tools. Proceedings of the COLING 2000 Workshop on for the World Wide Web. Proceedings of the 8th
Semantic Annotation and Intelligent Content, International World Wide Web Conference,
Luxembourg (2000). pp. 195–212, Toronto, Canada (1999).
[11] Handschuh, S. and Staab, S.: Authoring and [26] Spinks, R., Topol, B., Seekamp, C., and Ims, S.:
annotation of Web pages in CREAM. Proceedings of Document clipping with annotation. IBM
the 11th International World Wide Web Conference, developerWorks,
pp. 462–473, Honolulu, Hawaii (2002). http://www.ibm.com/developerworks/ibm/library/ibm-
[12] Heflin, J. and Hendler, J.: Semantic interoperability clip/
on the Web. Proceedings of Extreme Markup (2001).
Languages 2000, pp. 111–120 (2000). [27] XML Path Language (XPath) Version 1.0. W3C
[13] Hori, M., Kondo, G., Ono, K., Hirose, S., and Singhal, Recommendation, http://www.w3.org/TR/xpath
S.: Annotation-based Web content transcoding. (1999).
Proceedings of the 9th International World Wide Web [28] XSL Transformations (XSLT) Version 1.0. W3C
Conference, pp. 197–211, Amsterdam, Netherlands Recommendation, http://www.w3.org/TR/xslt (1999).
(2000).