=Paper=
{{Paper
|id=None
|storemode=property
|title=Deriving Dynamics of Web Pages: A Survey
|pdfUrl=https://ceur-ws.org/Vol-707/TWAW2011-paper4.pdf
|volume=Vol-707
|dblpUrl=https://dblp.org/rec/conf/www/OitaS11
}}
==Deriving Dynamics of Web Pages: A Survey==
<pdf width="1500px">https://ceur-ws.org/Vol-707/TWAW2011-paper4.pdf</pdf>
<pre>
                  Deriving Dynamics of Web Pages: A Survey*

                               Marilena Oita                                                     Pierre Senellart
                     INRIA Saclay – Île-de-France                                               Institut Télécom
                   & Télécom ParisTech; CNRS LTCI                                        Télécom ParisTech; CNRS LTCI
                            Paris, France                                                         Paris, France
             marilena.oita@telecom-paristech.fr                                   pierre.senellart@telecom-paristech.fr


ABSTRACT                                                                          crawlers [25], especially, need the ability to detect change
The World Wide Web is dynamic by nature: content is                               in Web content and to infer when a Web page last changed.
continuously added, deleted, or changed, which makes it                           This ability is fundamental in maintaining the coherence of
challenging for Web crawlers to keep up-to-date with the                          the crawl, in adjusting its refresh rate, in versioning, and
current version of a Web page, all the more so since not                          to allow a user to retrieve meaningful temporal data. The
all apparent changes are significant ones. We review major                        understanding of the dynamics of Web pages, that is, how
approaches to change detection in Web pages and extraction                        fast the Web content changes and what the nature of these
of temporal properties (especially, timestamps) of Web pages.                     changes is, its implications on the structure of the Web,
We focus our attention on techniques and systems that have                        and the correlation with the topic of pages are also popular
been proposed in the last ten years and we analyze them to                        subjects in the research literature [4].
get some insight into the practical solutions and best practices                     In addition to being of paramount importance to Web
available. We aim at providing an analytical view of the                          archiving, the subject of change detection is of interest in
range of methods that can be used, distinguishing them on                         various applications and domains, such as: large-scale infor-
several dimensions, especially, their static or dynamic nature,                   mation monitoring and delivery systems [18, 15, 24, 17, 23] or
the modeling of Web pages, or, for dynamic methods relying                        services1 , Web cache improvement [11], version configuration
on comparison of successive versions of a page, the similarity                    and management of Web archives [30], active databases [18],
metrics used. We advocate for more comprehensive studies                          servicing of continuous queries [1]. Research has focused on
of the effectiveness of Web page change detection methods,                        finding novel techniques for comparing snapshots of Web
and finally highlight open issues.                                                pages (a reference Web page and its updated version) in
                                                                                  order to detect change and estimate its frequency. Change
                                                                                  can however be detected at various levels: there are various
Categories and Subject Descriptors                                                aspects of dynamics which must be considered when studying
H.3.5 [Information Storage and Retrieval]: Online In-                             how Web content changes and evolves.
formation Services—Web-based services                                                The majority of works have seen the change detection
                                                                                  problem from a document-centric perspective, as opposed to
General Terms                                                                     an object-centric one. By object or entity we mean here any
                                                                                  Web content, part of a Web page, that represents meaningful
Algorithms, Experimentation, Measurement                                          information per se: image, news article, blog post, comment,
                                                                                  etc.. Comparatively, little effort has been put on making the
Keywords                                                                          difference between relevant changes and those that might
Change monitoring, Web archiving, Timestamping                                    occur because of the dynamic template of a Web page (ads,
                                                                                  active layout, etc.), i.e., its boilerplate [21].
                                                                                     We study in this article some of the strategies that have
1.    INTRODUCTION                                                                been established in different settings, with the aim at pro-
   The World Wide Web challenges our capacity to develop                          viding an overview of the existing techniques used to derive
tools that can keep track of the huge amount of informa-                          temporal properties of Web pages.
tion that is getting modified at speed rate. Web archiving                           There is a large body of work on the related problem
*This research was funded by the European Research Council                        of change detection in XML documents, particularly for
grant Webdam FP7-ICT-226513.                                                      purposes of data integration and update management in
                                                                                  XML-centric databases. However, the solutions developed for
                                                                                  XML documents cannot be applied without serious revisions
                                                                                  for HTML pages. The model assumptions made for XML
                                                                                  do not really hold for HTML. Indeed, the HTML and XML
                                                                                  formats have a key difference: while an XML page defines
                                                                                  the nature of the content by its meta tags, HTML tags define
Copyright 2011 for the individual papers by the papers’ authors. Copying          1
permitted only for private and academic purposes. This volume is published          Examples of these include http://timelyweb.en.
and copyrighted by its editors.                                                   softonic.com/, http://www.urlywarning.net/, http://
TWAW 2011 March 28, 2011, Hyderabad, India                                        www.changealarm.com/, http://www.changedetect.com/.


                                                                             25
TWAW 2011, Hyderabad, India

mainly presentational aspects of content within. In addition           erated by content management systems, etc.). Some CMSs
to the challenges that exist in XML documents, Web pages               do return correct HTTP timestamps, such as MediaWiki2 ,
add some others by their lack of formal semantics, a fuzziness         but they seem to be a minority.
regarding the formatting, by the embedding of multimedia                  In [12], Clausen presents an experimental study of the
and scripts, etc.. Separate approaches commonly need to be             reliability of Etags and HTTP timestamps on a collection
adopted and the research done on XML documents is beyond               of a few million Danish Web pages. He finds out that the
the scope of our paper, although we mention some works on              best strategy for avoiding useless downloads of versions of
XML [37, 14] that have particular relevance to Web page                Web pages already available in a Web archive is to always
change detection. We refer the reader to [13] for a survey of          download when the ETag server is missing, and otherwise
XML change detection algorithms.                                       download only if the Last-Modified header indicates change.
  We focus in this survey on deriving dynamics of HTML                 This rather counterintuitive result yielded in this experimen-
documents.                                                             tal study an almost perfect prediction of change, and a 63%
  Change detection mechanisms can either be static, esti-              accuracy in predicting non-change. Given that the majority
mating the date of last modification of content from the Web           of Web servers run some version of the open-source Apache3
page itself (its code, semantics or neighbors), or dynamic, by         HTTP server [26], it would be interesting to see whether this
comparing successive versions of a Web page. The structure             strategy is correlated with some inherent behavior of this
of this article reflects these dimensions. In Section 2, we            software. Furthermore, repeating this experiment on a larger
present static approaches to timestamping Web pages, while             scale and with a more recent set of Web pages would be of
Section 3 introduce dynamic methods. We then analyze in                interest.
Section 4 the different models of a Web page used by exist-               HTTP also provides the Cache-Control and Expires re-
ing techniques for comparing successive versions. Similarity           sponse headers. This information is often given, but with a
metrics used in dynamic methods are independently investi-             zero or very low expiration delay, which means that nothing
gated in Section 5. We briefly describe statistical modeling           interesting can be derived from it. In some specific and
approaches to estimate change frequency in Section 6. We               controlled environments (e.g., Intranets), it might still be
conclude with a discussion of some remaining open questions.           useful to look at these two pieces of information to estimate
                                                                       the refresh rate of a Web page.
2.    STATIC APPROACHES: TIMESTAMPING                                  2.2    Timestamps in Web content
  This section deals with methods for inferring temporal
                                                                          Content management systems, as well as Web authors,
properties of a Web page in a static manner as opposed to
                                                                       often provide in the content of a Web page some human-
the commonly used dynamic computation of the difference
                                                                       readable information about its last date of modification. This
between successive versions of a given Web page. The goal
                                                                       can be a global timestamp (for instance, preceded by a “Last
here is to infer the creation or the last modification date of a
                                                                       modified” string, in the footer of a Web page) or a set of
Web page or, possibly, of some parts of it. We study sources
                                                                       timestamps for individual items in the page, such as news
of data that can be useful for that purpose: metadata, the
                                                                       stories, blog posts, comments, etc. In the latter case, the
content of the Web page itself, or its graph neighborhood.
                                                                       global timestamp might be computed as the maximum of
The canonical way for timestamping a Web page is to use
                                                                       the set of individual timestamps. It is actually quite easy to
the Last-Modified HTTP header. Unfortunately, studies
                                                                       extract and recognize such information, with keyword selec-
have shown this approach is not reliable in general [12].
                                                                       tion (last, modification, date, etc.) or with entity recognizers
We describe next why this happens in practice and other
                                                                       for dates (built out of simple regular expressions). However,
techniques for timestamping Web pages.
                                                                       this timestamp is often quite informal and partial: there
2.1    HTTP metadata                                                   is sometimes no time indication, and most of the time no
                                                                       timezone. To the best of our knowledge, no formal study of
   HTTP/1.1, the main protocol used by Web clients and
                                                                       the precision reached by extracting timestamps from Web
servers to exchange information, offers several features of
                                                                       content has been carried out.
interest for timestamping, the foremost of which are the ETag
and Last-Modified HTTP response headers. Entity tags (or               2.3    Semantic temporal associations
ETags) are unique identifiers for a given version of a particu-
                                                                          In addition to these timestamps provided to humans, doc-
lar document. They are supposed to change if and only if the
                                                                       uments on the Web may include additional semantic times-
document itself changes. Servers can return this with any re-
                                                                       tamps meant for machines. No mechanism for this exists in
sponse, and clients can use the If-Match and If-None-Match
                                                                       HTML per se, but the HTML specification [36] allows for
HTTP requests headers to condition the retrieval of the doc-
                                                                       arbitrary metadata in the form of <meta> tags, one particular
ument to a change in the ETag value, avoiding then to
                                                                       profile of such metadata being Dublin Core4 whose modified
retrieve already known contents. If-Modified-Since and
                                                                       term indicates the date of last modification of a Web page.
If-Unmodified-Since provide conditional downloading fea-
                                                                       Both content management systems and Web authors occa-
tures, in a similar way as for ETags. Even when conditional
                                                                       sionally use this possibility. Web feeds (in RSS or Atom
downloading is not possible, Etags and HTTP timestamps
                                                                       formats) also have semantic timestamps, which are quite re-
can be retrieved by a Web client without downloading a whole
                                                                       liable, since they are essential to the working of applications
Web page by making use of the head HTTP method. The
                                                                       that exploit them, such as feed readers. In some cases, exter-
problem is that while this information is generally provided
                                                                       nal semantic content can be used for dating an HTML Web
and is very reliable for static content (e.g., static HTML
                                                                       2
pages or PDF), it is most of the time missing or changed at              http://www.mediawiki.org/
                                                                       3
every request (the timestamp given is that of the request,               http://www.apache.org/
                                                                       4
not of the content change) when the content is dynamic (gen-             http://dublincore.org/documents/dcmi-terms/


                                                                  26
                                                                                               TWAW 2011, Hyderabad, India

page: for instance, an RSS feed containing blog entries can            however to crawl frequently enough (better too frequently
be mapped to the corresponding Web page, in order to date              than not frequent enough). Once we have these versions,
individual items [29]. Another case is that of sitemaps [35].          we can detect if there is any change that happened in the
Sitemaps are files that can by provided by the owner of a              interval of time that represents the interval of crawl. Based
Web site to describe its organization, so as to improve its            on this, timestamps can be derived with good approximation.
indexing by Web search engines. Sitemaps allow for both                  The majority of works consider that they have an access
timestamps and change rate indications (hourly, monthly,               to the versions and they focus on detecting change efficiently.
etc.), but these features are not often used. Very few content         Few make a semantic distinction between changes by disre-
management systems produce all of this, although it would              garding insignificant ones. Next, we present some general
have been the ideal case: the download of a single file would          notions about changes: what kind of changes can occur in
suffice to get all timestamping information about the whole            Web pages, which have been identified in our studied ap-
Web site.                                                              proaches and which have not, and finally we give an insight
                                                                       into how change is actually represented.
2.4      Using the neighborhood
   It is possible to use the graph structure of the Web to help        3.1    Types of changes
timestamping Web pages: [28] uses the neighboring pages                   We summarize here the types of changes detected by differ-
of a Web page to estimate its timestamp. When no source                ent approaches. There are however other, more sophisticated
of reliable timestamps is found for a given page using one             types that are sometimes mentioned, but not treated. For in-
of the technique described above, its timestamp is set to              stance, behavioral changes are mentioned in [38]; they occur
some form of average of the timestamps of pages pointed to             in active HTML elements like scripts, embedded applications
and by this page. The inherent assumption is that pages                and multimedia. These new forms of Web content have a big
linked together tend to have a similar update patterns. The            impact on the Web today, but they require a more complex
precision is not very high, but better than nothing when no            modeling.
other information is available.                                           All considered approaches detect changes in content.
                                                                          Works that model the HTML page as a tree, including
3.     DYNAMIC METHODS                                                 page digest encoding, usually detect also structural and
   When static techniques do not give adequate results, there          attribute changes. However, differences exist in the coverage
is still the possibility of comparing a Web page with its              of cases for these types of changes. For example, MH-Diff [9]
previous version in order to determine (sometimes in a rough           detects also move and copy structural changes, which is an
way) an equivalent of Last-Modified. The timestamp that                improvement over the traditional detection of insert, delete
gives the last modification of a Web page can be inferred              and update. Structural (or layout) changes occur when the
then as the date when change has been detected.                        position of elements in the page is modified. Attribute
   Nevertheless, for a precise estimation, a just-in-time crawl        (or presentation) changes are related to the representation
of versions is needed. In reality, the frequency of change is          of information, for instance changes in the font, colors or
quite difficult to estimate because Web pages have different           captions. For capturing structural and attribute changes,
patterns of change. In general, many factors determine                 the model has to be aware their existence; this implies a
variations in the frequency of change for a given Web page:            more complex model which influences generally in a negatice
the CMS, the subject, the time of the year, even the time of           manner the performance. Unlike flat-file models of Web
the day, etc.                                                          pages, the output of content change detection in hierarchical
   Estimating the Web pages’ frequency of change is the                models is more meaningful: the type of node in which the
subject of many studies [16, 2, 27, 10, 19]. Their results             content change occured can also be identified.
are however heavily dependent on the technique of detecting               There are also type changes, that are modifications which
change that they have used.                                            come about when the HTML tags change: e.g., a p tag which
   There are two parameters that influence the process of              becomes a div. Type changes can be detected by [31] which
determining the dynamics:                                              uses the page digest encoding that provides a mechanism for
                                                                       locating nodes of a particular type (see further).
     1. the frequency of change is not known in advance,                  Semantic types of change capture the meaning of content
        but if we do not crawl the Web pages on time, we               that has changed. They are defined in SCD [23], a pioneer
        miss versions and the timestamp detected will be then          work in this direction.
        imprecise;                                                        Changes are sometimes captured in a quantitative man-
                                                                       ner rather than in a qualitative one. In contrast with the
     2. the change detection technique, which heavily re-              qualitative way, where the change is described (in a delta
        lies on the similarity metrics and on the model (that          file) or visualized in a comparative manner, quantitative
        can capture more or less types of changes) and the fil-        approaches estimate the amount of change of a specific type.
        tering of dynamic elements that influence the frequency        More specifically, all approaches that use the similarity for-
        without being truly relevant (and which occur quite            mula defined in CMW [17] do not reconstruct the complete
        often in Web pages because of AJAX applications or             sequence of changes, but give a numerical value of it. In this
        advertisements).                                               case, supposing a threshold of change, we can determine if a
   A method of filtering irrelevant content is to know what is         page has changed or not, which actually represent more an
important rather than trying to filter what is unimportant.            oracle response, still useful in the majority of applications.
If we knew in advance the frequency of crawl, then we would
know when change occurs and therefore set timestamps for               3.2    The representation of change
new versions. This is unfortunately not the case, but we need            There are various ways to present the difference between


                                                                  27
TWAW 2011, Hyderabad, India

two documents. Changes are usually stored in a physical                discuss these aspects.
structure generically called delta file or delta tree. The format         First, the modeling of a Web page into a tree requires
of storing the change for RMS [38] consists in specialized             a preprocessing step of cleaning. This is a significant one
set of arrays that capture the relationships among the nodes           because it corrects missing or mismatching, out-of-order
and the changes that occur both in structure and content.              end tags, as well as all other syntactic ill-formedness of the
Systems for monitoring change like [24, 18] have typically             HTML document. A tree is constructed first by filtering
a user interface and present changes in a graphical way.               the “tag soup” HTML into an XML document and second,
HTMLdiff merge the input Web page versions into one                    by manipulating the result in order to obtain a workable
document that will summarize all the common parts and                  tree structure using implementations of the DOM standard
also the changed ones. The advantage is that the common                (that usually employ XSLT and XPath). Sometimes also
parts are displayed just once, but on the other hand, the              an initial pruning of elements is done: [22] is filtering out
resulting merged HTML can be syntactically or semantically             scripts, applets, embedded objects and comments.
incorrect. Another choice linked to change presentation is                Many works do not specify how they realize the cleaning,
to display only the differences and omit the common parts              so either they assume it to be done in advance, or they solve
of the two Web pages. When the documents have a lot of                 this issue by using an HTML cleaning tool. HTML Tidy5 is
data in common, presenting only the differences could be               well-suited for this purpose and mentioned in [3].
better, with the drawback that the context is missing. The                There are however cases [23] when the tree model does not
last approach is to present the differences between the old            need to be cleaned: as the semantics of tags (value, name) is
and new version side by side.                                          leveraged in the technique, the structure does not have to
   These presentation modes are used in combination, rather            be enforced.
than being the unique choice for a given system. For exam-
ple, [24, 18] are presenting the results of the change monitor-        Ordered trees.
ing service using a hybrid approach: presentation modes are               The ordered characteristic of trees implies that the order
combined depening on the type of change that is presented.             of appearance of nodes is considered in the algorithm and
                                                                       therefore included in the model.
4.    HTML DOCUMENT MODELS                                                RMS algorithm [38] stores the level (depth) of a node,
   This section contains an overview of the models that are            information which will be used in the tree traversal and
considered in the quest of detecting changes in Web docu-              parsing. The SCD [23] algorithm uses branches of ordered
ments. The modeling step is a key one as it will determine             trees to detect semantic changes. The notion of branch is
the elements on which the comparison algorithms operate.               used to give the context of a node and is formalized as an
   We first discuss the “naïve” approach, that is, to consider         ordered multiset, in which the elements are designated by the
Web pages as flat files (strings); then we describe tree models,       tag name of the HTML non-leaf node, or its content (text),
a popular choice in the literature. We explore also some               if it represents a leaf node. The order is very important in
approaches that are based on tree models, but which are                this model because of the data hierarchy considered (e.g.,
essentially transforming the two versions to be compared in            book.author.name.Eminescu vs. a markup hierarchy like
a bipartite graph on which specific algorithms are applied.            div.p.b.PCDATA). In this model, if we change the order, we
Finally, we present the Page Digest design of Web pages, a             change the semantics of hierarchies or this semantics becomes
manner of encoding that clearly separates structural elements          incoherent.
of Web documents from their content, while remaining highly               Ordered tree model is also used in [20].
compact.
                                                                       Unordered trees.
4.1    Flat-files                                                         The unordered labeled tree model does not consider the or-
   Some early change detection systems model Web pages as              der of appearance of elements in the tree as relevant, instead
flat files [15]. As these models do not take into account the          only the parent-child relationships are captured.
hierarchical structure of HTML documents and neither the                  It is mentioned in MHDiff [9] that the change detection
characteristics of the layout, they can detect only content            problem for unordered trees is harder that for ordered ones.
changes – and this without making any semantic difference              Like [17, 9], [22] constructs a weighted bipartite graph from
in the content.                                                        the two trees given as entry. In these models, the order has
   Some works [34] try to filter first irrelevant content by using     not an importance on the final structure for which further
heuristics on the type of content and regular expressions.             processing will be done, therefore this feature is not captured.
After this basic filtering, the Web page content is direcly               [3] delimits and encodes subtrees; these are generated by
hashed and compared between versions. Unfortunately, we                analyzing the level of a node in the unordered tree. From
can never filter all kind of inconvenient content, especially          the root, when we arrive at level % 3, a subtree is created;
when its manner of encoding or type get more complex.                  the node at (level % 3 + 1) becomes the local root of the
                                                                       following subtree, and this iteratively until the end of the
4.2    Trees                                                           hierarchy. Each subtree is marked with the tag name of
   A natural approach is to represent a Web page as a tree             its local root and indexed based on its start and end node
using the DOM model. By taking into account the hier-                  identifiers. A hashtable will finally map metadata about
archies, structural and attribute changes can be detected              nodes (like tag name, content, path to the root, attributes
besides content changes.                                               pair values, etc.).
   Differences between tree models appear in their ordered
characteristics and in the level of granularity on which change
                                                                       5
is detected: node, branch, subtree or “object”. We will further            http://tidy.sourceforge.net/


                                                                  28
                                                                                               TWAW 2011, Hyderabad, India

4.3   Bipartite graph                                                 model elements will be further studied. A typical matching
  The bipartite graph model is a model derived from un-               is based on comparisons of attribute values of the model
ordered trees: it consists in two independent sets (acquired          elements. If they have the same properties, then they are
after tree pruning), each corresponding to a version, that            similar.
are connected by cost edges. As set elements, subtrees are               We will describe next the types of similarity metrics used
chosen over nodes in [22, 17]. The assumption [17] is that            for change detection in different settings or applications, for
we might be more interested in which subtree the change               the studied cases.
occurs, than in which specific node.
  The cost of an edge represent the cost of the edit scripting        5.1    String matching techniques
needed to make a model entity (node or subtree) of the first             Approaches that compare two Web pages modeled as flat
set isomorphic with one of the second set. The similarities           files (i.e. strings), rely on hash-based methods, edit distance
between all subtrees in the first tree and all those of the           metrics or techniques based on the longest common subse-
second tree are computed and placed in a cost matrix. Having          quence. We do not exhaustively cover all the possiblities,
this matrix, the Hungarian [6] algorithm is used to find              but rather present some of them that we have encountered
in polynomial time a minimum-cost bijection between the               in the analyzed works.
two partitions. This algorithm is typically used in linear
programming to find the optimal solution to the assignment            Naïve: Jaccard.
problem.                                                                 One simple technique for change detection in text is to
  CMW [17] as well as MH-Diff [9] algorithm, are based                count the number of words that are common (regardless of
on transforming the change detection problem in one of                their position) for two string sequences and to divide the
computing a minimum cost edge cover on a bipartite graph.             result by the number of distinct words. Another possiblity
Optimizations are brought out in [22] in comparison with              is to divide by the length of the first string, as done in [30].
the work in [6], but the general aim and similarity metrics
remain the same as in [3].                                            Hash-based methods.
                                                                         In this category, we mention [8, 20, 3]. The method of [8]
4.4   Page Digest                                                     uses shingling, which is a flexible method of detecting changes
   The page digest model6 has been adopted in [31] and [38],          between two strings. It is usually referred to as 𝑤-shingling,
and represents a more compact encoding of data than HTML              where 𝑤 the denotes the number of tokens in each shingle
and XML formats, while preserving all their advantages. To            in the set. A shingle is a contiguous subsequence (𝑤-gram)
construct this model, some steps are performed: counting              of the reference string. For the two strings to be compared,
the nodes, enumerating children in a depth-first manner (in           their shingles are hashed; if these strings have a lot of shingle
order to capture the structure of the document), and content          values in common, then they are similar.
encoding and mapping for each node - encoding which will                 Another method is to use signatures. A signature of a
preserve the natural order of text in the Web page.                   string is a function of its hashed value. When the model is
   SDiff [31] is a Web change monitoring application that             hierarchical, the signature of a node is computed based on its
uses a digest format that includes also tag type and attribute        terminal path. For a formatting (or non-leaf) node, the signa-
information. RMS [38] also uses this model, although without          ture represents the sum of the signatures of its children, until
giving many details. The advantages of the Page Digest over           leaf nodes, where changes are actually detected. In [20], only
the DOM tree model are enumerated in [31]. We mention                 nodes that have different signatures from those in the original
minimality and execution performance: the reduction of tag            version will be compared. Change detection algorithms that
redundancy gives a more compact model, therefore a faster             employ signatures have the disadvantage that false negatives
document traversal. The algorithms developed on this model            are possible: change exists, but a different signature for it
run in linear time without making too many heuristics or              does not. It obviously depends on the careful choice of the
restrictions, while capturing also a large palette of changes.        space of hashing, and eventually on the application: if it can
                                                                      tolerate or not false results.
                                                                         Another hash-based approach is presented in [3], where
5.    SIMILARITY METRICS                                              the change is detected at atomic element level using a direct
   Similarity metrics are used in dynamic methods of de-              hash.
tecting change, in the matching stage of the two versions
modelled as described in section 4.
                                                                      Longest common subsequence.
   For string models, (content) change is identified when the
                                                                         A diff is a file comparison program that outputs the differ-
data strings are discovered to be partially unsimilar. For
                                                                      ences between two files. Diff tools are based on computing the
tree models that have various dimensions, the problem gets
                                                                      longest common subsequence between two strings [32]. For
more complex.
                                                                      instance, HTMLDiff uses GNU diff utility adapted for HTML
   The aim is to identify model elements that are essentially
                                                                      page change detection. This program treats Web pages as
the same, but which have been affected by change. Model
                                                                      strings and, after processing, highlights the changes directly
elements that are the identical are pruned because they
                                                                      in a merged document; as mentioned in [15], HTMLDiff
have basically not changed between versions; also, totally
                                                                      can consume significant memory and computation resources,
dissimilar model elements do not represent instances of the
                                                                      and this might have an influence on the scalability of the
object that has evolved, so will not be included in further
                                                                      tool. Tree models of Web pages also use diff techniques, but
processing steps. Essentially, only approximatively similar
                                                                      not at HTML page level, but at subtree (or node) content
6
 http://www.cc.gatech.edu/projects/disl/                              level, which has the advantage of making the computation
PageDigest/                                                           less complex. For instance, WebCQ [24] uses HTMLDiff to


                                                                 29
TWAW 2011, Hyderabad, India

detect change at object level. Here, the object represents in       decisions. MH-Diff [9] for example defines the costs as being
fact a part of a Web page specified for monitoring by the           moderated by some constants, all depending on the type of
user, either by means of regular expressions or by marking          change.
elements of the HTML DOM tree like a table, list, link etc..
   Another example of system that uses GNU diff tool is             Quantitative measures of change.
WebVigiL [18].                                                         Various works [17, 3, 22] use a composed measure of simi-
                                                                    larity that tries to better adapt to the specific of the types
Edit scripting on strings.                                          of changes considered. The change is measured by a formula
   The edit distance between two strings of characters rep-         that incorporates three sub-measures of specific similarity:
resents the number of operations required to transform one          on the content (intersect: the percentage of words that ap-
string into another. There are different ways of defining           pear in both textual content of subtrees), attributes (attdist:
an edit distance, depending on which edit operations are            the relative weight of the attributes that have the same value
allowed: delete, insert, etc.. In string edit scripting, the        in the model elements) and on the types of elements consid-
atomic element is a single character and the cost is usually        ered in the path (typedist emphasizes differences on the tag
unitary, for every edit operation defined.                          names when going up in the hierarchy). The final measure
                                                                    incorporates all above-defined types of similarity, together
Root Mean Square.                                                   with some parameters that are meant to influence the impor-
  Another notable way [38] of compute similarity is to use          tance of certain types of changes over others. The advantage
RMS(Root Mean Square) value, i.e., the quadratic mean.              of this measure is that it captures the symbiosis of different
RMS represents a statistical measure for the magnitude of           types of changes that occur in a certain way independently:
a varying quantity and permits therefore to quantify the            content changes in leaf nodes, attribute changes in internal
change. If its value is small, then the difference between          nodes; the third submeasure is more focused on the position
the compared elements is not significative. Often used in           of nodes in the structure.
engineering to do an estimation of the similarity between a            Another quantitative measure of change is proposed in [23].
canonical model and an empirical one (in order to see the           Here, a weighted measure that determines the magnitude of
precision of the experiment), RMS formula needs numeric             the difference between two ordered multisets (i.e., branches)
values of the model parameters. In the HTML context,                is employed. In an ordered multiset, the weight of the ith
ASCII values of the Web document’s text characters are              node is defined as (2𝑖 )−1 (where 𝑖 represents the depth of an
utilized in the canonical formula. Although a good idea,            element of the branch considered). Finally, the quantity of
RMS measure applied for the ASCII values of each character          change is measured by computing the sum of the weights of
has some drawbacks: first it does not take into account             the nodes that appear in the symmetric difference set.
the hierarchies (it considers that every character has equal
influence, independently of its position in the page), and          6.    STATISTICALLY ESTIMATING CHANGE
second, it cannot take into account the semantics of context.
Variants of this measure are presented in [39].                     6.1    Motivating estimative models
                                                                       We have seen the multitude of domains that are interested
5.2    Matching of hierarchical models                              in the change detection issue. The general aim is to get an
                                                                    idea about the dynamics of a certain type of content, at a
Edit scripting on trees.                                            given granularity.
   has the same formal definition as for strings, excepting for        In Web crawler-related applications, the interest is more
the fact that the basic entities are here tree elements (nodes      in whether a Web page has changed or not, in order to know
or subtrees). Also, edit operations can occur at different          if a new version of a Web page shall be downloaded or not.
levels, depending on the types of changes considered. As a          Only a binary response is needed; this does not happen
consequence, cost models of edit operations become more             because it is not interesting to make a distinction between
complex. Each edit operation has a cost associated (cost            the different types of changes that can occur, but because
proportional to the complexity of the operation, or based on        current crawlers treat information at Web page level. In this
heuristics of the model), so the execution of an edit script        case, an estimation of the change frequency is as effective as
as a sequence of operations will return a cumulated cost.           explicitly computing it, as we show in Section 3. Indeed, if
The similarity measure can be then computed as the inverse          archive crawlers were more aware of the semantics of data
of this total cost. The interpretation is the following: less       they process, they could clearly benefit of a broader, richer
unimportant modifications we do to the first structure of           insight into the data and could develop different strategies
data in order to make it isomorphic with its version, the           related to storage and processing. An estimation of the
more similar the two structures are.                                change rate, although not very descriptive (we usually do not
   This problem of determining the distance between two             know where the change appeared or its type), is still useful,
trees is referred to as the tree-to-tree correction problem         especially when we can imagine a strategy that combines
and this is more in depth covered in [5]. Some works [9]            estimative and comparative methods of deriving dynamics.
report that edit scripting with moving operations is NP-            For this reason, we shortly present some of the existing
hard. Usually, every model introduces some kind of model            statistical approaches.
or computational heuristics that make the problem (a little)
less difficult. As an example, [23] computes an edit scripting      6.2    Poisson model
on branches. It can be also done on subtrees, rather than on          The study carried out in [10] reports the fact that changes
complete trees, for the same complexity reasons. Concerning         that occur in Web pages can be modeled as a Poisson process.
the cost of change operations, every technique makes its own        A Poisson process is used to model a sequence of random


                                                               30
                                                                                             TWAW 2011, Hyderabad, India

                                                                     ing of temporal properties of Web pages. The selection of
                                                                     subjects reflects our personal vision on the topic.

                                                                     Further studies on timestamping.
                                                                        With the large number of sources of timestamping hints,
                                                                     it should indeed be possible to estimate the freshness of
                                                                     a Web page. An experimental study on the reliability of
                                                                     these sources, perhaps in specific contexts (a given Web
                                                                     server software, a given CMS, etc.), which could provide
                                                                     more insight into optimal strategies for timestamping Web
                                                                     pages, is still to be carried out.

                                                                     Relevant change detection.
                                                                        An important question when detecting changes is whether
                                                                     the changes are relevant to the interests or needs of the user
                                                                     or archivist. Web users are exposed to many advertisements
                                                                     and content becomes pre-fabricated, put in containers and
                                                                     delivered in a targeted way, so much more attention should
                                                                     be paid to the relevance factor. Additionally, Web pages have
                                                                     some components that are more dynamic than others; it is not
                                                                     possible to say if the dynamics come from irrelevant content
                                                                     (think of ads changing at every request) or precisely because
 Figure 1: Summary of the presented approaches                       the content is very informative (e.g. frequently updated
                                                                     news). To make the distinction between these, techniques
                                                                     that define and extract semantics (e.g., by looking at the
events that happen independently with a fixed rate over time.        topic and linked data) might be used as a filtering method
Based on a(n ideally complete) change history, the frequency         or just for adding more significance to the change detection
of change is estimated. The time-independence assumption             results. The cleaning of the Web page or its segmentation into
in homogeneous models [10, 11] does not really capture the           semantic blocks is of interest in many information extraction
reality of the Web. The authors of [33] affirm that in the           fields and, although we mention only [40, 21, 29], there exists
case of dynamic Web content (like blogs), the posting rates          a large number of works that treat this subject.
vary very much depending on different parameters. Hence,                As an observation, there exists a subtle difference in the
they propose an inhomogeneous Poisson model (which do not            use of the term “meaningful” in the various works that we
assume that events happen independently); this model learns          have studied. While some works [23] use it to emphasize
the posting patterns of Web pages and predicts a recheck for         the fact that more types of changes are detected, other ap-
new content.                                                         proaches [30] use it as synonym to “relevant”, from the con-
  The work in [11] formalizes some use cases where, by adapt-        tent point of view. Vi-Diff [30] uses the VIPS algorithm [40]
ing the parameters of the Poisson model to the requirements          to get from a Web page an hierarchy of semantic blocks and
of the application, a better accuracy of the estimation can          detect changes only from this perspective, hopefully ignoring
be achieved. The situation when we do not have a complete            all boilerplate content. However, a deeper insight into the
change history of a Web page is treated, which is actually           relevance aspect is mentioned as future work; the authors
the real world case. Sometimes, we have only the last date of        of [30] talk about using machine learning techniques for this
change or we at best just know that a change has occurred.           issue, which would be an interesting line of research. Usually,
Contributions are also related to the fact that the authors          heuristics [40] are employed to get a measure of relevance
adapt the canonical model to interesting applications and            because having a generic source of knowledge which can be
feed different estimators, improving thus the technique for          automatically used for this task is very difficult.
the considered cases.                                                   Recently, [29] used the concepts that can be extracted from
                                                                     the description of Web feeds to get the content of interest
6.3   Kalman filters                                                 from Web pages. However, this kind of semantics can only be
   Other statistical approach to change detection is that of         obtained in the case of Web pages that have feeds associated.
[7]. Here, the textual vector space model is employed to                With the emergence of the Semantic Web, we envision new
identify the patterns of the page and to train Kalman filters        ways of distinguishing timestamps or of filtering irrelevant
with these patterns. In the end, the change represents the           content from Web pages and therefore more efficient methods
event that does not match the prediction. The possible               for deriving dynamics of Web pages.
disadvantages of this method is that it assumes the linearity
of the system (because of the the Kalman filter model, which         8.   REFERENCES
is doing an exact inference in a linear dynamical system)             [1] S. Abiteboul. Issues in monitoring web data. In Proc.
and uses an incomplete vector space model (for complexity                 DEXA, 2002.
reasons).                                                             [2] E. Adar, J. Teevan, S. T. Dumais, and J. L. Elsas. The
                                                                          Web changes everything: Understanding the dynamics
7.    OPEN QUESTIONS                                                      of Web content. In Proc. WSDM, 2009.
  We end this article by discussing several issues that have          [3] H. Artail and K. Fawaz. A fast HTML Web change
not been addressed yet and which are related to the deriv-                detection approach based on hashing and reducing the


                                                                31
TWAW 2011, Hyderabad, India

     number of similarity computations. Data Knowl. Eng.,       [24] L. Liu, C. Pu, and W. Tang. WebCQ: Detecting and
     66(2), 2008.                                                    delivering information changes on the Web. In Proc.
 [4] R. Baeza-Yates, C. Castillo, and F. Saint-Jean. Web             CIKM, 2000.
     dynamics, structure, and page quality. In M. Levene        [25] J. Masanès, editor. Web Archiving. Springer-Verlag,
     and A. Poulovassilis, editors, Web Dynamics. Springer,          2006.
     2004.                                                      [26] Netcraft. January 2011 Web survey.
 [5] D. T. Barnard, G. Clarke, and N. Duncan. Tree-to-tree           http://news.netcraft.com/archives/2011/01/12/
     correction for document trees. Technical Report 95-372,         january-2011-web-server-survey-4.html, 2011.
     Queen’s University, Kingston, Ontario, Canada, 1995.       [27] A. Ntoulas, J. Cho, and C. Olston. What’s new on the
 [6] D. P. Bertsekas and D. A. Castañon. Parallel                    Web? The evolution of the Web from a search engine
     asynchronous Hungarian methods for the assignment               perspective. In Proc. WWW, 2004.
     problem. INFORMS J. Computing, 5(3), 1993.                 [28] S. Nunes, C. Ribeiro, and G. David. Using neighbors to
 [7] P. L. Bogen, II, J. Johnson, U. Karadkar, R. Furuta,            date Web documents. In Proc. WIDM, 2007.
     and F. M. Shipman, III. Application of Kalman filters      [29] M. Oita and P. Senellart. Archiving data objects using
     to identify unexpected change in blogs. In Proc. JCDL,          web feeds. In Proc. IWAW, 2010.
     2008.                                                      [30] Z. Pehlivan, M. Ben Saad, and S. Gançarski. A novel
 [8] A. Z. Broder. On the resemblance and containment of             Web archiving approach based on visual pages analysis.
     documents. In Proc. SEQUENCES, 1997.                            In Proc. IWAW, 2009.
 [9] S. Chawathe and H. Garcia-Molina. Meaningful change        [31] D. Rocco, D. Buttler, and L. Liu. Page digest for
     detection in structured data. In Proc. SIGMOD, 1997.            large-scale Web services. In Proc. CEC, 2003.
[10] J. Cho and H. Garcia-Molina. The evolution of the          [32] S. C. Sahinalp and A. Utis. Hardness of string
     Web and implications for an incremental crawler. In             similarity search and other indexing problems. In Proc.
     Proc. VLDB, 2000.                                               ICALP, 2004.
[11] J. Cho and H. Garcia-Molina. Estimating frequency of       [33] K. C. Sia, J. Cho, and H.-K. Cho. Efficient monitoring
     change. ACM TOIT, 3(3), 2003.                                   algorithm for fast news alerts. IEEE TKDE, 19(7),
[12] L. R. Clausen. Concerning Etags and datestamps. In              2007.
     Proc. IWAW, 2004.                                          [34] K. Sigurðsson. Incremental crawling with Heritrix. In
[13] G. Cobéna and T. Abdessalem. A comparative study of             Proc. IWAW, 2005.
     XML change detection algorithms. In Service and            [35] sitemaps.org. Sitemaps XML format.
     Business Computing Solutions with XML. IGI Global,              http://www.sitemaps.org/protocol.php, 2008.
     2009.                                                      [36] W3C. HTML 4.01 specification.
[14] G. Cobéna, S. Abiteboul, and A. Marian. Detecting               http://www.w3.org/TR/REC-html40/, 1999.
     changes in XML documents. In Proc. ICDE, 2002.             [37] Y. Wang, D. J. DeWitt, and J.-Y. Cai. X-Diff: An
[15] F. Douglis, T. Ball, Y.-F. Chen, and E. Koutsofios. The         effective change detection algorithm for XML
     AT&T Internet difference engine: Tracking and viewing           documents. In Proc. ICDE, 2003.
     changes on the Web. World Wide Web, 1(1), 1998.            [38] D. Yadav, A. K. Sharma, and J. P. Gupta. Change
[16] D. Fetterly, M. Manasse, M. Najork, and J. Wiener. A            detection in Web pages. In Proc. ICIT, 2007.
     large-scale study of the evolution of Web pages. In        [39] D. Yadav, A. K. Sharma, and J. P. Gupta. Parallel
     Proc. WWW, 2003.                                                crawler architecture and web page change detection.
[17] S. Flesca and E. Masciari. Efficient and effective Web          WSEAS Transactions on Computers, 7(7), 2008.
     page change detection. Data Knowl. Eng., 46(2), 2003.      [40] S. Yu, D. Cai, J.-R. Wen, and W.-Y. Ma. Improving
[18] J. Jacob, A. Sanka, N. Pandrangi, and S. Chakravarthy.          pseudo-relevance feedback in Web information retrieval
     WebVigiL: An approach to just-in-time information               using Web page segmentation. In Proc. WWW, 2003.
     propagation in large network-centric environments. In
     M. Levene and A. Poulovassilis, editors, Web Dynamics.
     Springer, 2004.
[19] A. Jatowt, Y. Kawai, and K. Tanaka. Detecting age of
     page content. In Proc. WIDM, 2007.
[20] H. P. Khandagale and P. P. Halkarnikar. A novel
     approach for web page change detection system. Intl. J.
     Comput. Theory Eng., 2(3), 2010.
[21] C. Kholschutter, P. Fankhauser, and W. Nejdi.
     Boilerplate detection using shallow text features. In
     Proc. WSDM, 2010.
[22] I. Khoury, R. M. El-Mawas, O. El-Rawas, E. F.
     Mounayar, and H. Artail. An efficient Web page
     change detection system based on an optimized
     Hungarian algorithm. IEEE TKDE, 19(5), 2007.
[23] S.-J. Lim and Y.-K. Ng. An automated
     change-detection algorithm for HTML documents
     based on semantic hierarchies. In Proc. ICDE, 2001.


                                                           32

</pre>