=Paper= {{Paper |id=Vol-101/paper-9 |storemode=property |title=Robustness of External Annotation for WebPage Clipping: Empirical Evaluation with Evolving RealLife Web Documents |pdfUrl=https://ceur-ws.org/Vol-101/Masahiro_Hori-et-al.pdf |volume=Vol-101 }} ==Robustness of External Annotation for WebPage Clipping: Empirical Evaluation with Evolving RealLife Web Documents== https://ceur-ws.org/Vol-101/Masahiro_Hori-et-al.pdf
Robustness of External Annotation for Web-Page Clipping:
      Empirical Evaluation with Evolving Real-Life
                    Web Documents
                                                                                       ∗
                       Masahiro Hori                                        Mari Abe                           Kouichi Ono
             Faculty of Informatics, Kansai                        IBM Tokyo Research                     IBM Tokyo Research
                       University                                      Laboratory                             Laboratory
                 2-1-1 Ryozenji-cho,                             1623-14 Shimotsuruma,                  1623-14 Shimotsuruma,
                     Takatsuki-shi                                     Yamato-shi                             Yamato-shi
               Osaka 569-1095, Japan                            Kanagawa 242-8502, Japan               Kanagawa 242-8502, Japan
              horim@res.kutc.kansai-                                 maria@jp.ibm.com                     onono@jp.ibm.com
                     u.ac.jp


ABSTRACT                                                                             technologies of customizing Web documents suitable for de-
Web metadata is crucial for providing machine-understandable                         livery context. The customization or document adaptation
descriptions of Web resources, and has a number of appli-                            requires annotation that indicates the ways of modifying the
cations such as discovery, qualification, and adaptation of                          document at hand.
Web documents. While annotations are often embedded                                     Annotations can be embedded into a Web document as in-
into a Web document, annotations can also be associated                              line annotations, which are often created as extra attributes
externally by means of addressing expressions represented                            of document elements. Most existing HTML browsers ignore
with the XPath language. However, there has been little                              unknown attributes added to HTML elements, without be-
empirical study of robust pointing using XPath expressions,                          ing bothered by the proprietary inline annotations. Because
in spite of the increasing prevalence of the XPath language                          of its simplicity, inline annotation has been often adopted as
not only for use with XSLT, but also in emerging content                             a way of associating annotation with HTML documents [22,
adaptation systems. The objective of this study is to evalu-                         25, 10, 12]. An advantage of the inline approach is the ease
ate the robustness of XPath expressions especially the ones                          of annotation maintenance without the bookkeeping task of
that can be used for the generation of external annotations,                         associating annotations with their target document. The
and draw practical implications to the reliable use of exter-                        inline approach, however, requires annotators to have doc-
nal annotation.                                                                      ument ownership because annotated documents need to be
                                                                                     modified whenever inline annotations are created or revised.
                                                                                        On the other hand, the external annotation approach [13]
1.     INTRODUCTION                                                                  does not suffer from these issues related to document own-
   Web annotation is crucial for providing not only human-                           ership. The important point of the external annotation ap-
readable remarks, but also machine-understandable descrip-                           proach is that it facilitates the sharing and reuse of anno-
tions, and has a number of applications such as discovery,                           tations across Web documents. In addition, the mixing of
qualification, and adaptation of Web contents [19]. As more                          content and metadata is not desirable with regard to the
and more Web-enabled personal devices are becoming avail-                            design guideline that content should be separated from pre-
able for connecting to the Internet, the same Web docu-                              sentation. Therefore, it is assumed in this study that such
ments need to be rendered differently on different client de-                        metadata is maintained separately from a target document,
vices. Adaptation of Web document to delivery context is                             and exploited dynamically at runtime by a content adapta-
thus crucial for transparent Web access, which may depend                            tion engine.
on client capabilities, network connectivity, or user prefer-                           Since Web documents may change over time, it is not
ences [8]. The long-term goal of our research is to establish                        always obvious what kinds of addressing expression keep
                                                                                     pointing the same target element regardless of the document
∗This author also belongs to Graduate School of Science and
                                                                                     changes. It was reported that a key complaint in the use of
Technology, Keio University.                                                         electronic annotation was the situation in which an anno-
                                                                                     tation cannot point any portion of a target document [4].
                                                                                     These are aspects of the issue related to robust positioning,
                                                                                     which has been investigated in a couple of empirical studies
                                                                                     [24, 3]. However, there has been little empirical study of
Permission to make digital or hard copies of all or part of this work for            robust pointing using XPath expressions, in spite of the in-
personal or classroom use is granted without fee provided that copies are            creasing prevalence of the XPath language not only for use
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to      with XSLT, but also in emerging content adaptation systems
republish, to post on servers or to redistribute to lists, requires prior specific   [13, 26, 23, 2].
permission and/or a fee.                                                                The objective of this study is to evaluate the robustness
Copyright 2000 ACM 0-89791-88-6/97/05 ..$5.00
of XPath expressions especially the ones that can be used
for the generation of external annotations, and draw prac-               Table 1: Variations in annotation tool design
tical implications to the reliable use of external annotation.                                                         Authoring methods
In the next section, we introduce variations in annotation
tools, on the basis of two authoring methods (annotation by                                                        By s elec t i o n   By example
selection and by example) as well as the different roles of
                                                                                        A s s er t i o n                (a)              N/A
annotations for assertion and transformation. Section 3 ex-                R ol es of
plains an annotation language for Web page clipping, which               a nnota tion
                                                                                        T r an s f o r mat i o n        (b)               (c)
has been adopted for commercially available software prod-
ucts, and its application to page clipping for small-screen de-
vices and portal site development. In Section 4, we present
an empirical evaluation of the robustness of XPath expres-         annotation by example would be much easier for annotators,
sions with regard to the changes in real-life HTML pages. In       because the annotators can work with a concrete example
particular, it was investigated to what extent those expres-       and create a desired result interactively with the example.
sions continued to point at the same nodes in the modified         In particular, the example-based method allows annotators
pages during the observation period of one year and three          to generate transformational annotations on the basis of an-
months. Finally, we discuss the advantages and limitations         notators’ editing operations conducted to come up with a de-
of the XPath expressions taking account of the generation          sired result. This type of annotation tools follows transfor-
of external annotations.                                           mational annotation by example [Table 1(c)]. This example-
                                                                   based method is particularly useful for the transformational
2.   VARIATIONS IN ANNOTATION TOOLS                                annotations, and would not make sense for assertional anno-
   An annotation in general declares properties that qualify       tations, because it is not intuitive for annotators to indicate
a particular portion of a target document. In some cases,          assertional annotations as results of structural changes of a
however, annotations may indicate structural changes for           target document.
the annotated portion of a target document. In order to               The core part of the tool configuration is independent of
clarify the distinction of these two roles, the former is called   any particular views and editors, and consists of two docu-
assertional annotation, while the latter transformational an-      ment object models (DOMs) [9]: one for a target document,
notation [14]. Note here that this distinction is not exclusive,   and another for an annotation document. It is assumed here
because every annotation is intrinsically an assertion.            that the creation of an annotation document is a primary
   Transformational annotation has been used for Web con-          task of annotators, and the annotators are not allowed to
tent adaptation, in which structural changes of a target doc-      modify a target document. The assertional annotation by
ument are needed [13, 23, 26, 28]. In contrast to assertional      selection is the most typical and provide a comprehensive
annotation languages (such as Dublin Core Metadata [7]),           way of annotation. The details of this type of annotation
transformational annotation languages (such as XSLT [28])          tool is reported in another article [1, 16]. According to our
are more like programming languages, and not necessary             interests in customization of Web documents, this paper fo-
easy for annotators to create transformational annotations         cuses on the transformational annotation, and in particular
by using conventional annotation by selection approach.            the two advanced approaches to generating annotations by
   It is simple for annotation authors or annotators to in-        selection [Table 1(b)] and by example [Table 1(c)].
dicate a location to be annotated and create an assertion
as annotation content. This is an approach that we call            2.1     Transformational Annotation by Selection
annotation by selection, and is adopted by existing annota-           Figure 1 depicts the configuration for transformational
tion tools [1, 5, 10, 13, 17, 23, 11]. On the other hand, for      annotation by selection. This type of annotation tools relies
transformational annotations, it is easier for annotators to       on a target document viewer, because portions of a target
modify a target document toward the desired results of the         document can only be selected without any modification.
customization, rather than to indicate the ways of modifi-         With this type of annotation tools, first an annotator opens
cations declaratively as assertional annotations. This is a        a target document to be customized. The annotator then
basic idea behind an approach what we call annotation by           selects portions of the target document by using a document
example, which was originally proposed in our previous work        viewer [Figure 1 (a)], and indicates how each of the selected
on the generation of XSLT rules [18].                              portion to be modified (e.g., remove and enlarge). Trans-
   According to the distinctions of annotation authoring meth-     formational annotation can then be generated [Figure 1 (b)]
ods and the roles of annotation, Table 1 summarizes varia-         on the basis of the portions of a target document selected
tions in annotation tools. Annotators can select a portion of      by an annotator.
document to be annotated and declare properties on the se-
lected portion as assertional annotation. This type of anno-       2.2     Transformational Annotation by Example
tation tools support assertional annotation by selection [Ta-         If a person knows how to perform a task to be executed
ble 1(a)], and most of the existing annotation tools fall into     by a computer, perhaps the person’s knowledge can some-
this category. Even when annotations are used for structural       how be exploited for the creation of a program to perform
changes of a target document, it is possible for authors to        the task. This is the idea behind programming by exam-
create transformational annotations by selecting portions to       ple [20]. Programming by example is a natural approach to
be changed and declare instructions of transformations as          generating transformational annotation for page designers
annotations. This type of annotation tools support trans-          or novice programmers, because such users need only work
formational annotation by selection [Table 1(b)].                  with examples of how to transform a document at hand, and
   In order to create transformational annotations, however,       are given with generated annotations that can replicate the
       Core components                                   Selection-based                                     (a)                                              Target document
                                                                                                         O ri gi nal
                                                       annotation generation                               p age           Transcoding            H TTP
                                                  Target
                                                                                                                              P rox y            S e rv e r            (U R L -x)
                 T a r g et                                              (a)                         U R L -x
                                               D o c u m en t
              D o cu m en t                      Viewer                                                                     C l ip p ing
                                                                                                                              e ngine
       annotate                                                                                      U R L -x
                                                                                                                                                               A A nnotati
                                                                                                                                                                   nnotati on
                                                                                                                                                                           on
        T r a n s f o r m a ti o n a l   (b)    A n n o tati o n               Selected                        (b )                                               document
                                                                                                                                                                  document
              A n n o ta ti o n                  G en erato r                   n o des                   C l i p p ed
                                                                                                            p age




Figure 1: Tool configuration for transformational
                                                                                             Figure 3:                   Overview          of   an      annotation-based
annotation by selection
                                                                                             transcoding

       Core components                                   Example-based
                                                       annotation generation
                                                                                             are created for the desktop computers. However, it may
                T arg et                          Target                                     be necessary to deliver portions of this page for users to
                                                                   (a)     C u s tom iz ed
                                               D o c u m en t
              D oc u m ent                                                  D oc u m ent     access through a Web-enabled phone rather than a desk-
                                                  Editor
                                                                   (b)                       top browser. In such a case, the images and nested HTML
       annotate
                                                                                             tables prepared for a nicely laid out page are a hindrance
        T rans f orm ational             (c)    A n n o tati o n               Operation     rather than help. The sheer amount of information becomes
            A nnotation                          G en erato r                   h is tory    unwieldy in the small display, and potentially expensive de-
                                                                                             pending on the user’s wireless service.
                                                                                                Transformational annotations provide additional informa-
Figure 2: Tool configuration for transformational                                            tion about Web documents, so that an adaptation engine
annotation by example                                                                        can make better decisions on the content transformation.
                                                                                             The role of annotations here is to provide explicit semantics
                                                                                             that can be understood by a content adaptation engine [15].
same transformation.
   A configuration of the example-based annotation tool is                                   3.1    Page Clipping for Small-Screen Devices
depicted in Figure 2. This type of annotation tools relies on                                    An overview of an annotation-based transcoding process
a target document editor rather than a viewer in contrast to                                 is depicted in Figure 3. Upon receipt of a request from a
the case of annotation by selection (see Figure 1). With this                                client, a Web document is retrieved from a content server.
type of annotation tools, first an annotator opens a target                                  Taking account of the capabilities of the client specified in
document to be customized (e.g., an HTML file). The anno-                                    the HTTP request header, a transcoding proxy selects one
tator then edits the document by using the full capabilities of                              or more transcoding modules. When a selected transcod-
a WYSIWYG authoring tool [Figure 2 (a)]. Although the                                        ing module requires an annotation document, an annota-
annotator’s editing actions are recorded into an operation                                   tion file is also retrieved from a content server, which may
history [Figure 2 (b)], the annotator does not have to care                                  or may not be the same server that retrieved the Web docu-
about the recording process behind the scenes. When the                                      ment. The transcoding module may simply return the origi-
editing is finished, the annotator will have a customized doc-                               nal document, if a client agent has the rendering capabilities
ument. At the same time, the annotation generator creates                                    compatible with ordinary desktop computers [Figure 3 (a)].
transformational annotation for the document customiza-                                      Alternatively, the original document may be returned with
tion [Figure 2 (c)], which can be used by a runtime engine                                   modification, so that the original content can fit into a small
(e.g., XSLT processor) to replicate the transformation from                                  screen device [Figure 3 (b)]. The decisions about the content
the initial target document to the customized document.                                      adaptation are made taking account of the client capabilities
Further details on the annotation generation procedure are                                   specified in the HTTP request header.
reported in the other articles [14, 18].                                                         Content adaptation can be done by using an annotation-
   It is possible for both selection-based and example-based                                 based page-clipping engine [26]. At content delivery time,
approaches to generate transformational annotations, but                                     the page-clipping engine may modify the original document
the selection-based approach is limited in the kinds of an-                                  with reference to page-clipping annotations and client pro-
notation constructs to be generated as compared with the                                     files sent over HTTP. The main idea in the page-clipping
example-based approach, because the expressiveness of an-                                    annotation language is the notion of a clipping state. By
notators’ selection on a document viewer is far more limited                                 using  and  elements in the annotation de-
than that of annotators’ full editing capability on a docu-                                  scriptions, users can specify the clipping state to indicate
ment editor.                                                                                 whether the content being processed should be preserved or
                                                                                             removed.
                                                                                                 As a simple example, an HTML page and its clipped re-
3.   ANNOTATION-BASED DOCUMENT ADAP-
                                                                                             sults are shown in Figure 4. In this example, the header and
     TATION                                                                                  the first paragraph are preserved as shown in Figure 4(a).
   Web pages for e-commerce, for example, contain a lot of                                   The table element is modified by deleting the third column
information such as details of products, product images, and                                 and the second row. The cell-padding attribute of the table
numerous links to other areas of the site, when the pages                                    is increased, so that each table cell can be provided with
        Original page                              C lipped page   ument elements encountered are preserved, until otherwise
                                                                   instructed by another annotation statement. The clipping
                                                                   state is changed to ’remove’ just before the second 

ele- (a) pres erv e ment [Figure 5(c)], and changed back to ’keep’ after the

element [Figure 5(d)]. As results, the second paragraph ele- ment indicated by /HTML[1]/BODY[1]/P[2] is removed while preserving the elements just before and after the re- (b ) m o d if y moved element. Since HTML tables can often be complex elements to clip, (c ) rem o v e the annotation language provides special-purpose elements to make table clipping easier. The and el- ements allow user to clip rows and columns without relying Figure 4: Simple example of an HTML page clipping on complicated XPath expressions. The table-clipping el- ements are used in the description shown in Figure 5(b). This description sets the clipping state to ’keep’ just be- fore the first table element, and also changes the value of target="/HTML[1]/BODY[1]/*[1]"> element. The name attribute of can be specified with an arbitrary name of an attribute available for a target document. In addition, the description element [Figure 5(b)] declares that the third column, which is indicated by the index value of the element, is discarded, while the remaining columns are preserved. Note here that the wildcard char- acter to indicate multiple columns (index="*"). If a wild- card is specified, all rows (or columns) will be affected, ex- cept for those specifically indicated by a separate (or ) element. So, all rows but the second are pre- served for the target table.
3.2 Page Clipping for Portal Site Development Annotation-based page clipping is a useful technique for of small-screen devices, but the advantages are not limited to device adaptation. Another promising application of the page clipping technology is the use in Web portals. Web por- it can provide a single point of comprehensive, integrated access to both Web data and applications. However, each
of the Web data or application is for the most cases pro- vided assuming to be presented on a desktop browser, and would be too spacious to fit into a small area in a portal Figure 5: Example of page-clipping annotations page. Page clipping is thus useful for Web pages that are aggregated into a portal site. Figure 6 illustrates the process of creating a portal page margin space [Figure 4(b)]. In addition, the whole of the with an annotation-based clipping portlet. Portlets are spe- second paragraph is removed as shown in Figure 4(c). All cialized servlets that plug into and run in portals, and allow the structural changes in HTML documents can be easily to generate dynamic contents. When a portal server receives done by using a WYSIWYG HTML editor. Figure 5 shows an annotation document that realizes the page clipping illustrated in Figure 4. This transformational H T T P P orta l P a ge re q u e s t P orta l Server annotation can actually be generated by using the example- Content Server A based annotation generation tool [14]. The Clipped Clipping P P agag e e Annotation Annotation element prescribes a unit of an annotation statement in pa ge portlet ( ( a)a) d d ococ u u m m e e nt nt the annotation language. The target attribute is set to an XPath expression, and identifies the node on which the O th er Content Server B O th er annotation will be applied, and the take-effect attribute pa ge portlet P P agag e e indicates whether the annotation is applied before or af- ((bb )) ter the target node. By specifying the value of target at- H T T P r e s p ons e tribute as /HTML[1]/BODY[1]/*[1] [Figure 5(a)], the clipping state is activated after the first element after the first element, which in this case is an

. The Figure 6: Creation of a portal page with annotation- element in Figure 5(a) indicates that all the doc- based clipping portlet A nnotati on T ool Portal Page on Browser Ad ap tation b y tr ans f or m ational annotation Annotation b y s e l e c tion Figure 7: Annotation tool for Web clipping portlet an HTTP request, the server dispatches the request to each ular node is not found. This type of addressing expressions portlet aggregated in the page, and collects the results into is what we call single-node pointing expressions, which are a portal page to be returned (Figure 6). suitable for automatic generation of addressing expressions, Figure 7 shows a screen of an annotation tool for clip- because single-node pointing expressions can be generated ping portlet in the left, and a portal page that includes the solely on the basis of focal nodes without human interven- clipped page in the right. This annotation tool allows a user tion. to select the portions of the original page to be removed in Although the XPath language [27] provides thirteen axes the portal page [cf., Figure 1 (a)], and the annotation gen- (such as ancestor, descendant, following, and preceding) for erator creates page-clipping annotations from the selected specifying the direction of node-set selection in a location nodes [cf., Figure 1 (b)]. step, there exist only the two axes, namely, the child and The selection-based annotation generation was actually descendant axes that can point to every element from a doc- adopted for a software product of an annotation tool for a ument root element, using only one kind of axis for every portal server, and extensively used in the development of location step with position number predicate. Therefore, a supplier portal of an automotive company. In this case, there are the only two kinds of single-node pointing expres- the automotive company extensively used the page-clipping sions created by using only one axis. One is to create an portlet with the annotation tool solely for the simple expression pointing to an element of the target document and clipping operations. The primary reason for using a sequence of child-position location steps (ChildPos- the customer’s choice was just the simplicity of the author- Seq). Another is for pointing to an element by means of a ing process without advanced annotation constructs for page descendant position (DescendantPos). clipping. Since the automotive company needs to aggregate The ChildPosSeq expression simply follows the hierarchy several thousands of existing pages into the portal site, it of DOM tree from the root to a target element, and points was not practical to create sophisticated clipping annota- to at most one element by a sequence of child positions (e.g., tions for page by page, and it was reasonable to provide just /html[1]/body[1]/table[2]/tbody[1]/tr[1]/td[2]). The simple clipping capability to remove headers and side menus DescendantPos expression, on the other hand, indicates the in the original documents that were created for browsers on number of an element with the same tag name in the doc- desktop computers. ument order among all the descendant nodes from the root node, and points to at most one element by a descendant position (e.g., /descendant::table[8]). 4. EMPIRICAL EVALUATION The two annotation tools mentioned above have been us- The page-clipping annotation explained in the previous ing the ChildPosSeq expression, and the DescendantPos was section is adopted for commercially available software prod- not used for the generation of addressing expressions. In ucts, and has been used for a number of real applications for the remainder of this section, we present an empirical eval- Web document adaptation. Moreover, the above-mentioned uation of the robustness of XPath expressions, in order to annotation generation tools have been provided as toolkits draw practical implications to the reliable use of external for the software products: the example-based annotation annotation. tool for transcoding proxy [26] and the selection-based an- notation tool for portal server development [6]. 4.2 Evaluation Method and Results Table 2 shows the basic data of the observed HTML pages. 4.1 Single-Node Pointing Expression The pages A and B are a corporate top page and a product External annotations generated by these two tools use a page of the same company. The page C is a top page of a type of addressing expressions that points to at most one news media company, while the page D belongs to a software particular node, and will not point at anything if the partic- company. These pages were saved each day during about the 1000 Table 2: Basic data of the observed HTML pages Number of nodes with ID attribute Number of Depth of 800 Page URI nodes per page document tree [ave. (max, min)] [ave. (max, min)] A www.ibm.com/ 393 (441, 348) 21 (21, 20) 600 Page A B www.ibm.com/products/ 709 (758, 623) 27 (30, 20) Page B C public.wsj.com/ 952 (1333, 433) 22 (24, 21) Page D Page C 400 Page D D java.sun.com/ 909 (1311, 325) 22 (29, 15) The number of sample pages is 540 for each reference page. Page B 200 Page A Page C (a) 0 Reference page 1 51 101 151 201 251 301 351 401 451 501 Reference page with IDs Number of days Inserting IDs IDs (b) Figure 9: Number of nodes with ID attributes dur- ing the observation period Differencing Patching IDs Sample pages Difference files Sample pages (b) (1 .. N) (1 .. N) with IDs (1 .. N) A test set of XPath expressions Figure 8: Process of node ID insertion (a) period of one year and three months, and 540 pages for each are collected as samples for this investigation. IDs 1 … (k) … N In order to make sure that an XPath expression actually points to the same node in a reference page, it is necessary Reference page Sample pages with IDs to know the node-to-node mapping between the reference with IDs page and other sample pages with the same URI that are collected during the observation period. For the purpose Figure 10: Process of testing XPath expressions of this empirical study, an ID attribute was added to every DOM node of the reference and sample pages. Note that the most of existing HTML pages are not well-formed, and tion, like the most tree differencing and merging tools, the therefore cannot be reliably pointed to by XPath expres- 3DM algorithm regards an attribute-value change as dele- sions. It is assumed here that both the reference page and tion of the modified node followed by insertion of the same sample pages are parsed by an HTML parser in advance, node with updated attribute value. This means the ID at- and converted to DOM trees before the node ID insertion tributes disappear when a node is updated as results of an process. attribute-value change. Figure 8 illustrates the process of node ID insertion, which Figure 9 shows the numbers of nodes ID attributes for can be done in the following two steps. The fist step is to add each sample page along the 540-day observation period. For unique identifiers as an attribute value (e.g., uid="N27") to the pages A and B, there is a sudden decrease in the number every node in the reference page [Figure 8 (a)]. Note that of nodes with IDs just after the 50th day due to the small comment and text nodes were excluded from the ID inser- style changes. In addition, since the site design for the pages tion, because an attribute cannot be added to those nodes. C and D has been changed, the numbers of nodes with IDs The second step consists of differencing and patching [Fig- were suddenly decreased in the page C after the 232nd day ure 8 (b)]. The DOM-tree difference was calculated taking and the page D after the 367th day. The numbers in the account of the changes in each sample page as compared page D tentatively decreased during about 10 days from the with a reference page with the same URL. Each difference 272nd day. This is because the site design was temporally file consists of a sequence of edit operations that transform changed due to the annual developer conference1 held by that the reference page into a corresponding sample page. The company. difference files are then applied to the reference page with The number of nodes with IDs decreases as results of unique identifiers. Finally, as results of the patching, we can the deletion, insertion, and update of DOM nodes. Note obtain a set of sample pages with unique identifiers. As a here that the differencing algorithm was always applied to differencing and merging tool for XML documents, we used obtain node-to-node mappings from a reference page to a the 3DM tool [21]. sample page. Therefore, the difference from the reference It is straightforward to insert ID attributes to all the page would not necessarily increase monotonically, but may DOM nodes in the reference pages. However, due to the be reduced later if the page were modified again to be closer changes in Web pages over time, some nodes will be inserted to the reference page. This is why the number of nodes with into or deleted from the documents. The newly inserted IDs increases at some points in the observed period. nodes do not have any ID attribute, because they cannot 1 be mapped from any node in the reference page. In addi- The 2002 JavaOne Conference 100 Number of nodes 0 400000 800000 1200000 Precentage of correct pointing 80 IDValueMatch 60 ChildPosSeq Exact DescendantPos ChildPosSeq Exclusive 40 Nonexistent 20 DescendantPos 0 Figure 12: Number of nodes with detailed distinc- Figure 11: Percentage of correctly pointed nodes for tion of pointing status each XPath expression attribute we created in the ID-insertion process (Figure 8). Every DOM node with a unique identifier in a reference The IDValueMatch corresponds to a situation when every page was regarded as a reference node. For each reference node of a document at hand is provided with an unique node, we created ChildPosSeq and DescendantPos expres- identifier. Although the attribute name “uid” might be ac- sions, so that they can point to the corresponding reference cidentally duplicated in some sample pages, no duplication node. This set of XPath expressions constituted a test set was not found within the pages examined for this study. [Figure 10 (a)]. Each expression in the test set was then Besides the numbers of the exact (correct) pointing, one applied to sample pages with the same URI [Figure 10 (b)]. of the notable feature in Figure 12 is the large number of If an XPath expression actually points to a node with the the exclusive pointing errors for the DescendantPos. An same ID as the reference node, the expression was regarded exclusive pointing error means that an XPath expression as correct in the sample page. Note that an XPath ex- points to a node or a set of nodes that does not include the pression may point to multiple nodes in a target document. target node at all. However, since this evaluation only deals with single-node If we consider only exact pointing, this example is not pointing expressions, correct expressions must point to the so serious as simply a case of incorrect pointing. However, only one node with the same ID in a sample page, and must when we use XPath expressions for external annotations, not point to any other nodes in the sample page. we need to further think about the actual influences of such Figure 11 shows the percentages of nodes that are cor- incorrect pointing with regard to the behavior of the run- rectly pointed to by each type of expressions. The Child- time engines such as XSLT processors for XSLT stylesheets. PosSeq always scores higher percentage than DescendantPos Therefore, it is also important to investigate the robustness expression. In the next section, we will further investigate of XPath expressions taking account of application scenar- the robustness of the two types of expressions. ios. 5. DISCUSSION 6. CONCLUDING REMARKS A correctly pointing expression points to a target node In this paper, we presented variations in annotation tools, that is identical with the reference target node as identi- and explained the two types of tools that generate transfor- fied by the ID-attribute value, and does not points to any mational annotation for Web document clipping. Since the other nodes other than the reference target. Here we call transformational annotations are descriptions of the ways of this correct pointing as exact pointing. When an XPath ex- modifying the document at hand, the semantics of the docu- pression does not correctly point to a target node, the ways ment adaptation can easily be indicated through annotator’s of incorrect pointing can be categorized into three types: demonstration or editing actions to obtain the desired result nonexistent, inclusive, and exclusive. Nonexistent pointing of adaptation. Although the example-based annotation tool is when an XPath expression points to nothing. Inclusive is the most sophisticated approach to creating transforma- pointing is when an expression points to a node set that in- tional annotation, it may suffer from difficulties in inferring cludes not only the target node but also nodes other than the annotator’s intension behind the editing process. On the target. Exclusive pointing is when an expression points the other hand, the selection-based annotation is a simpler to a node or a set of nodes that does not include the target approach, and limited in the capabilities of annotation gen- node at all. eration. However, it is noteworthy that the simplicity was a Figure 12 divides the nodes up by pointing status. The to- real advantage for use in the development of clipping portlet, tal number of tested nodes was more than a million (1,280,880), because it was not practical to create sophisticated clipping counting all the HTML elements included in the sample annotations for several thousands of pages to be aggregated pages throughout the entire observation period. The IDVal- into a supplier portal. ueMatch expressions rely on the ID-attribute value of the Finally, the empirical study presented in this paper is target node. An example of the IDValueMatch expression tightly bound to the specific HTML pages we investigated, is //*[@uid=’N35’], where the “uid” is the name of the ID and we know that our sample sizes are too small to have statistical validity. Therefore, we are not claiming that the [14] Hori, M., Ono, K., Koyanagi, T., and Abe, M.: results can easily be generalized to all the other kinds of Annotation by transformation for the automatic HTML documents. However, this empirical study is an generation of content customization metadata. In F. important step towards establishing hypotheses regarding Mattern and M. Naghshineh (Eds.) Pervasive phenomenon that may hinder the practical use of external Computing, First International Conference, Pervasive annotations or metadata that exploit XPath expressions. 2002, Lecture Notes in Computer Science 2414, pp. 267–281, Zurich, Switzerland (2002). 7. REFERENCES [15] Hori, M.: Semantic annotation for Web content [1] Abe, M. and Hori, M.: A visual approach to authoring adaptation. In D. Fensel, J. Hendler, H. Lieberman, XPath expressions. Proceedings of Extreme Markup and W. Whalster (Eds), Spinning the Semantic Web, Languages 2001, pp. 1–14 Montréal, Canada (2001). pp. 542–573, MIT Press, Boston, MA (2002). [2] Asakawa, C. and Takagi, H.: Transcoding system for [16] Hori, M., Abe, M. and Ono, K.: Extensible framework non-visual Web access (2): annotation-based of authoring tools for Web document annotation. transcoding. Sixteenth International Conference on Proceedings of International Workshop on Semantic Technologies and Persons with Disabilities Web Foundations and Application Technologies (CSUN2001) (2001). (SWFAT), pp. 1-8, Nara, Japan (2003). [3] Brush, A. J., Bargeron, D., Gupta, A., and Cadiz, J. [17] Kahan, J. and Koivunen, M.-R.: Annotea: an open J.: Robust annotation positioning in digital RDF infrastructure for shared Web annotations. documents. Proceedings of the 2001 ACM Conference Proceedings of the 10th International World Wide on Human Factors in Computing Systems (CHI 2001), Web Conference, pp. 623–632, Hong Kong (2001). pp. 285–292, Seattle, Washington (2001). [18] Koyanagi, T., Ono, K., and Hori, M.: Demonstrational [4] Cadiz, J. J., Gupta, A., and Grudin, J.: Using Web Interface for XSLT Stylesheet Generation. Markup annotations for asynchronous collaboration around Languages: Theory & Practice, 2(2): 133–152 (2001). documents. Proceedings of ACM 2000 Conference on [19] Lassila, O.: Web metadata: a matter of semantics. Computer Supported Cooperative Work (CSCW 2000), IEEE Internet Computing, 2(4): 30–37 (1998). pp. 309–318, Philadelphia, PA (2000). [20] Lieberman, H. (Ed.): Your Wish is My Command: [5] Denoue, L. and Vignollet, L.: An annotation tool for Programming by example. Morgan Kaufmann Web browsers and its applications to information Publishers, San Francisco (2001). retrieval. Proceedings of the 6th Conference on [21] Lindholm, T.: A 3-way merging algorithm for Content-Based Multimedia Information Access (RIAO synchronizing ordered trees – The 3DM merging and 2000), Paris, France (2000). differencing tool for XML. Master Thesis, Department [6] DeWitt, S. : Basic Web Clipping Using WebSphere of Computer Science, Helsinki University of Portal Version 4.1. IBM WebSphere Developer Domain, Technology (2001). http://www7b.software.ibm.com/wsdd/library/techarticles/ [22] Mea, V. D., Beltrami, C. A., Roberto, V., and 0206 dewitt/dewitt.html (2002). Brunato, D.: HTML generation and semantic markup [7] Dublin Core Metadata Element Set, Version 1.1: for telepathology. Proceedings of the 5th International Reference Description. Dublin Core Metadata World Wide Web Conference, pp. 1085–1094, Paris, Initiative, Recommendation, France (1996). http://dublincore.org/documents/dces/ (1999). [23] Nagao, K., Shirai, Y., and Kevin, S.: Semantic [8] Device Independence Principles. W3C Working Draft, annotation and transcoding: making Web content http://www.w3.org/TR/di-princ/ (2001). more accessible. IEEE Multimedia, 8(2): 69–81 (2001). [9] Document Object Model (DOM) Level 1 Specification [24] Phelps, T. A. and Wilensky, R.: Robust Version 1.0. W3C Recommendation, intra-document locations. Proceedings of the 9th http://www.w3.org/TR/REC-DOM-Level-1/ (1998). International World Wide Web Conference, [10] Erdmann, M., Maedche, A., Schnurr, H.-P., and pp. 105–118, Amsterdam, Netherlands (2000). Staab, S.: From manual to semi-automatic semantic [25] Rousseau, J. F., Macias, A. G., de Lima, J. V., and annotation: about ontology-based text annotation Duda, A.: User adaptable multimedia presentations tools. Proceedings of the COLING 2000 Workshop on for the World Wide Web. Proceedings of the 8th Semantic Annotation and Intelligent Content, International World Wide Web Conference, Luxembourg (2000). pp. 195–212, Toronto, Canada (1999). [11] Handschuh, S. and Staab, S.: Authoring and [26] Spinks, R., Topol, B., Seekamp, C., and Ims, S.: annotation of Web pages in CREAM. Proceedings of Document clipping with annotation. IBM the 11th International World Wide Web Conference, developerWorks, pp. 462–473, Honolulu, Hawaii (2002). http://www.ibm.com/developerworks/ibm/library/ibm- [12] Heflin, J. and Hendler, J.: Semantic interoperability clip/ on the Web. Proceedings of Extreme Markup (2001). Languages 2000, pp. 111–120 (2000). [27] XML Path Language (XPath) Version 1.0. W3C [13] Hori, M., Kondo, G., Ono, K., Hirose, S., and Singhal, Recommendation, http://www.w3.org/TR/xpath S.: Annotation-based Web content transcoding. (1999). Proceedings of the 9th International World Wide Web [28] XSL Transformations (XSLT) Version 1.0. W3C Conference, pp. 197–211, Amsterdam, Netherlands Recommendation, http://www.w3.org/TR/xslt (1999). (2000).