=Paper=
{{Paper
|id=None
|storemode=property
|title=Deriving Dynamics of Web Pages: A Survey
|pdfUrl=https://ceur-ws.org/Vol-707/TWAW2011-paper4.pdf
|volume=Vol-707
|dblpUrl=https://dblp.org/rec/conf/www/OitaS11
}}
==Deriving Dynamics of Web Pages: A Survey==
Deriving Dynamics of Web Pages: A Survey*
Marilena Oita Pierre Senellart
INRIA Saclay – Île-de-France Institut Télécom
& Télécom ParisTech; CNRS LTCI Télécom ParisTech; CNRS LTCI
Paris, France Paris, France
marilena.oita@telecom-paristech.fr pierre.senellart@telecom-paristech.fr
ABSTRACT crawlers [25], especially, need the ability to detect change
The World Wide Web is dynamic by nature: content is in Web content and to infer when a Web page last changed.
continuously added, deleted, or changed, which makes it This ability is fundamental in maintaining the coherence of
challenging for Web crawlers to keep up-to-date with the the crawl, in adjusting its refresh rate, in versioning, and
current version of a Web page, all the more so since not to allow a user to retrieve meaningful temporal data. The
all apparent changes are significant ones. We review major understanding of the dynamics of Web pages, that is, how
approaches to change detection in Web pages and extraction fast the Web content changes and what the nature of these
of temporal properties (especially, timestamps) of Web pages. changes is, its implications on the structure of the Web,
We focus our attention on techniques and systems that have and the correlation with the topic of pages are also popular
been proposed in the last ten years and we analyze them to subjects in the research literature [4].
get some insight into the practical solutions and best practices In addition to being of paramount importance to Web
available. We aim at providing an analytical view of the archiving, the subject of change detection is of interest in
range of methods that can be used, distinguishing them on various applications and domains, such as: large-scale infor-
several dimensions, especially, their static or dynamic nature, mation monitoring and delivery systems [18, 15, 24, 17, 23] or
the modeling of Web pages, or, for dynamic methods relying services1 , Web cache improvement [11], version configuration
on comparison of successive versions of a page, the similarity and management of Web archives [30], active databases [18],
metrics used. We advocate for more comprehensive studies servicing of continuous queries [1]. Research has focused on
of the effectiveness of Web page change detection methods, finding novel techniques for comparing snapshots of Web
and finally highlight open issues. pages (a reference Web page and its updated version) in
order to detect change and estimate its frequency. Change
can however be detected at various levels: there are various
Categories and Subject Descriptors aspects of dynamics which must be considered when studying
H.3.5 [Information Storage and Retrieval]: Online In- how Web content changes and evolves.
formation Services—Web-based services The majority of works have seen the change detection
problem from a document-centric perspective, as opposed to
General Terms an object-centric one. By object or entity we mean here any
Web content, part of a Web page, that represents meaningful
Algorithms, Experimentation, Measurement information per se: image, news article, blog post, comment,
etc.. Comparatively, little effort has been put on making the
Keywords difference between relevant changes and those that might
Change monitoring, Web archiving, Timestamping occur because of the dynamic template of a Web page (ads,
active layout, etc.), i.e., its boilerplate [21].
We study in this article some of the strategies that have
1. INTRODUCTION been established in different settings, with the aim at pro-
The World Wide Web challenges our capacity to develop viding an overview of the existing techniques used to derive
tools that can keep track of the huge amount of informa- temporal properties of Web pages.
tion that is getting modified at speed rate. Web archiving There is a large body of work on the related problem
*This research was funded by the European Research Council of change detection in XML documents, particularly for
grant Webdam FP7-ICT-226513. purposes of data integration and update management in
XML-centric databases. However, the solutions developed for
XML documents cannot be applied without serious revisions
for HTML pages. The model assumptions made for XML
do not really hold for HTML. Indeed, the HTML and XML
formats have a key difference: while an XML page defines
the nature of the content by its meta tags, HTML tags define
Copyright 2011 for the individual papers by the papers’ authors. Copying 1
permitted only for private and academic purposes. This volume is published Examples of these include http://timelyweb.en.
and copyrighted by its editors. softonic.com/, http://www.urlywarning.net/, http://
TWAW 2011 March 28, 2011, Hyderabad, India www.changealarm.com/, http://www.changedetect.com/.
25
TWAW 2011, Hyderabad, India
mainly presentational aspects of content within. In addition erated by content management systems, etc.). Some CMSs
to the challenges that exist in XML documents, Web pages do return correct HTTP timestamps, such as MediaWiki2 ,
add some others by their lack of formal semantics, a fuzziness but they seem to be a minority.
regarding the formatting, by the embedding of multimedia In [12], Clausen presents an experimental study of the
and scripts, etc.. Separate approaches commonly need to be reliability of Etags and HTTP timestamps on a collection
adopted and the research done on XML documents is beyond of a few million Danish Web pages. He finds out that the
the scope of our paper, although we mention some works on best strategy for avoiding useless downloads of versions of
XML [37, 14] that have particular relevance to Web page Web pages already available in a Web archive is to always
change detection. We refer the reader to [13] for a survey of download when the ETag server is missing, and otherwise
XML change detection algorithms. download only if the Last-Modified header indicates change.
We focus in this survey on deriving dynamics of HTML This rather counterintuitive result yielded in this experimen-
documents. tal study an almost perfect prediction of change, and a 63%
Change detection mechanisms can either be static, esti- accuracy in predicting non-change. Given that the majority
mating the date of last modification of content from the Web of Web servers run some version of the open-source Apache3
page itself (its code, semantics or neighbors), or dynamic, by HTTP server [26], it would be interesting to see whether this
comparing successive versions of a Web page. The structure strategy is correlated with some inherent behavior of this
of this article reflects these dimensions. In Section 2, we software. Furthermore, repeating this experiment on a larger
present static approaches to timestamping Web pages, while scale and with a more recent set of Web pages would be of
Section 3 introduce dynamic methods. We then analyze in interest.
Section 4 the different models of a Web page used by exist- HTTP also provides the Cache-Control and Expires re-
ing techniques for comparing successive versions. Similarity sponse headers. This information is often given, but with a
metrics used in dynamic methods are independently investi- zero or very low expiration delay, which means that nothing
gated in Section 5. We briefly describe statistical modeling interesting can be derived from it. In some specific and
approaches to estimate change frequency in Section 6. We controlled environments (e.g., Intranets), it might still be
conclude with a discussion of some remaining open questions. useful to look at these two pieces of information to estimate
the refresh rate of a Web page.
2. STATIC APPROACHES: TIMESTAMPING 2.2 Timestamps in Web content
This section deals with methods for inferring temporal
Content management systems, as well as Web authors,
properties of a Web page in a static manner as opposed to
often provide in the content of a Web page some human-
the commonly used dynamic computation of the difference
readable information about its last date of modification. This
between successive versions of a given Web page. The goal
can be a global timestamp (for instance, preceded by a “Last
here is to infer the creation or the last modification date of a
modified” string, in the footer of a Web page) or a set of
Web page or, possibly, of some parts of it. We study sources
timestamps for individual items in the page, such as news
of data that can be useful for that purpose: metadata, the
stories, blog posts, comments, etc. In the latter case, the
content of the Web page itself, or its graph neighborhood.
global timestamp might be computed as the maximum of
The canonical way for timestamping a Web page is to use
the set of individual timestamps. It is actually quite easy to
the Last-Modified HTTP header. Unfortunately, studies
extract and recognize such information, with keyword selec-
have shown this approach is not reliable in general [12].
tion (last, modification, date, etc.) or with entity recognizers
We describe next why this happens in practice and other
for dates (built out of simple regular expressions). However,
techniques for timestamping Web pages.
this timestamp is often quite informal and partial: there
2.1 HTTP metadata is sometimes no time indication, and most of the time no
timezone. To the best of our knowledge, no formal study of
HTTP/1.1, the main protocol used by Web clients and
the precision reached by extracting timestamps from Web
servers to exchange information, offers several features of
content has been carried out.
interest for timestamping, the foremost of which are the ETag
and Last-Modified HTTP response headers. Entity tags (or 2.3 Semantic temporal associations
ETags) are unique identifiers for a given version of a particu-
In addition to these timestamps provided to humans, doc-
lar document. They are supposed to change if and only if the
uments on the Web may include additional semantic times-
document itself changes. Servers can return this with any re-
tamps meant for machines. No mechanism for this exists in
sponse, and clients can use the If-Match and If-None-Match
HTML per se, but the HTML specification [36] allows for
HTTP requests headers to condition the retrieval of the doc-
arbitrary metadata in the form of tags, one particular
ument to a change in the ETag value, avoiding then to
profile of such metadata being Dublin Core4 whose modified
retrieve already known contents. If-Modified-Since and
term indicates the date of last modification of a Web page.
If-Unmodified-Since provide conditional downloading fea-
Both content management systems and Web authors occa-
tures, in a similar way as for ETags. Even when conditional
sionally use this possibility. Web feeds (in RSS or Atom
downloading is not possible, Etags and HTTP timestamps
formats) also have semantic timestamps, which are quite re-
can be retrieved by a Web client without downloading a whole
liable, since they are essential to the working of applications
Web page by making use of the head HTTP method. The
that exploit them, such as feed readers. In some cases, exter-
problem is that while this information is generally provided
nal semantic content can be used for dating an HTML Web
and is very reliable for static content (e.g., static HTML
2
pages or PDF), it is most of the time missing or changed at http://www.mediawiki.org/
3
every request (the timestamp given is that of the request, http://www.apache.org/
4
not of the content change) when the content is dynamic (gen- http://dublincore.org/documents/dcmi-terms/
26
TWAW 2011, Hyderabad, India
page: for instance, an RSS feed containing blog entries can however to crawl frequently enough (better too frequently
be mapped to the corresponding Web page, in order to date than not frequent enough). Once we have these versions,
individual items [29]. Another case is that of sitemaps [35]. we can detect if there is any change that happened in the
Sitemaps are files that can by provided by the owner of a interval of time that represents the interval of crawl. Based
Web site to describe its organization, so as to improve its on this, timestamps can be derived with good approximation.
indexing by Web search engines. Sitemaps allow for both The majority of works consider that they have an access
timestamps and change rate indications (hourly, monthly, to the versions and they focus on detecting change efficiently.
etc.), but these features are not often used. Very few content Few make a semantic distinction between changes by disre-
management systems produce all of this, although it would garding insignificant ones. Next, we present some general
have been the ideal case: the download of a single file would notions about changes: what kind of changes can occur in
suffice to get all timestamping information about the whole Web pages, which have been identified in our studied ap-
Web site. proaches and which have not, and finally we give an insight
into how change is actually represented.
2.4 Using the neighborhood
It is possible to use the graph structure of the Web to help 3.1 Types of changes
timestamping Web pages: [28] uses the neighboring pages We summarize here the types of changes detected by differ-
of a Web page to estimate its timestamp. When no source ent approaches. There are however other, more sophisticated
of reliable timestamps is found for a given page using one types that are sometimes mentioned, but not treated. For in-
of the technique described above, its timestamp is set to stance, behavioral changes are mentioned in [38]; they occur
some form of average of the timestamps of pages pointed to in active HTML elements like scripts, embedded applications
and by this page. The inherent assumption is that pages and multimedia. These new forms of Web content have a big
linked together tend to have a similar update patterns. The impact on the Web today, but they require a more complex
precision is not very high, but better than nothing when no modeling.
other information is available. All considered approaches detect changes in content.
Works that model the HTML page as a tree, including
3. DYNAMIC METHODS page digest encoding, usually detect also structural and
When static techniques do not give adequate results, there attribute changes. However, differences exist in the coverage
is still the possibility of comparing a Web page with its of cases for these types of changes. For example, MH-Diff [9]
previous version in order to determine (sometimes in a rough detects also move and copy structural changes, which is an
way) an equivalent of Last-Modified. The timestamp that improvement over the traditional detection of insert, delete
gives the last modification of a Web page can be inferred and update. Structural (or layout) changes occur when the
then as the date when change has been detected. position of elements in the page is modified. Attribute
Nevertheless, for a precise estimation, a just-in-time crawl (or presentation) changes are related to the representation
of versions is needed. In reality, the frequency of change is of information, for instance changes in the font, colors or
quite difficult to estimate because Web pages have different captions. For capturing structural and attribute changes,
patterns of change. In general, many factors determine the model has to be aware their existence; this implies a
variations in the frequency of change for a given Web page: more complex model which influences generally in a negatice
the CMS, the subject, the time of the year, even the time of manner the performance. Unlike flat-file models of Web
the day, etc. pages, the output of content change detection in hierarchical
Estimating the Web pages’ frequency of change is the models is more meaningful: the type of node in which the
subject of many studies [16, 2, 27, 10, 19]. Their results content change occured can also be identified.
are however heavily dependent on the technique of detecting There are also type changes, that are modifications which
change that they have used. come about when the HTML tags change: e.g., a p tag which
There are two parameters that influence the process of becomes a div. Type changes can be detected by [31] which
determining the dynamics: uses the page digest encoding that provides a mechanism for
locating nodes of a particular type (see further).
1. the frequency of change is not known in advance, Semantic types of change capture the meaning of content
but if we do not crawl the Web pages on time, we that has changed. They are defined in SCD [23], a pioneer
miss versions and the timestamp detected will be then work in this direction.
imprecise; Changes are sometimes captured in a quantitative man-
ner rather than in a qualitative one. In contrast with the
2. the change detection technique, which heavily re- qualitative way, where the change is described (in a delta
lies on the similarity metrics and on the model (that file) or visualized in a comparative manner, quantitative
can capture more or less types of changes) and the fil- approaches estimate the amount of change of a specific type.
tering of dynamic elements that influence the frequency More specifically, all approaches that use the similarity for-
without being truly relevant (and which occur quite mula defined in CMW [17] do not reconstruct the complete
often in Web pages because of AJAX applications or sequence of changes, but give a numerical value of it. In this
advertisements). case, supposing a threshold of change, we can determine if a
A method of filtering irrelevant content is to know what is page has changed or not, which actually represent more an
important rather than trying to filter what is unimportant. oracle response, still useful in the majority of applications.
If we knew in advance the frequency of crawl, then we would
know when change occurs and therefore set timestamps for 3.2 The representation of change
new versions. This is unfortunately not the case, but we need There are various ways to present the difference between
27
TWAW 2011, Hyderabad, India
two documents. Changes are usually stored in a physical discuss these aspects.
structure generically called delta file or delta tree. The format First, the modeling of a Web page into a tree requires
of storing the change for RMS [38] consists in specialized a preprocessing step of cleaning. This is a significant one
set of arrays that capture the relationships among the nodes because it corrects missing or mismatching, out-of-order
and the changes that occur both in structure and content. end tags, as well as all other syntactic ill-formedness of the
Systems for monitoring change like [24, 18] have typically HTML document. A tree is constructed first by filtering
a user interface and present changes in a graphical way. the “tag soup” HTML into an XML document and second,
HTMLdiff merge the input Web page versions into one by manipulating the result in order to obtain a workable
document that will summarize all the common parts and tree structure using implementations of the DOM standard
also the changed ones. The advantage is that the common (that usually employ XSLT and XPath). Sometimes also
parts are displayed just once, but on the other hand, the an initial pruning of elements is done: [22] is filtering out
resulting merged HTML can be syntactically or semantically scripts, applets, embedded objects and comments.
incorrect. Another choice linked to change presentation is Many works do not specify how they realize the cleaning,
to display only the differences and omit the common parts so either they assume it to be done in advance, or they solve
of the two Web pages. When the documents have a lot of this issue by using an HTML cleaning tool. HTML Tidy5 is
data in common, presenting only the differences could be well-suited for this purpose and mentioned in [3].
better, with the drawback that the context is missing. The There are however cases [23] when the tree model does not
last approach is to present the differences between the old need to be cleaned: as the semantics of tags (value, name) is
and new version side by side. leveraged in the technique, the structure does not have to
These presentation modes are used in combination, rather be enforced.
than being the unique choice for a given system. For exam-
ple, [24, 18] are presenting the results of the change monitor- Ordered trees.
ing service using a hybrid approach: presentation modes are The ordered characteristic of trees implies that the order
combined depening on the type of change that is presented. of appearance of nodes is considered in the algorithm and
therefore included in the model.
4. HTML DOCUMENT MODELS RMS algorithm [38] stores the level (depth) of a node,
This section contains an overview of the models that are information which will be used in the tree traversal and
considered in the quest of detecting changes in Web docu- parsing. The SCD [23] algorithm uses branches of ordered
ments. The modeling step is a key one as it will determine trees to detect semantic changes. The notion of branch is
the elements on which the comparison algorithms operate. used to give the context of a node and is formalized as an
We first discuss the “naïve” approach, that is, to consider ordered multiset, in which the elements are designated by the
Web pages as flat files (strings); then we describe tree models, tag name of the HTML non-leaf node, or its content (text),
a popular choice in the literature. We explore also some if it represents a leaf node. The order is very important in
approaches that are based on tree models, but which are this model because of the data hierarchy considered (e.g.,
essentially transforming the two versions to be compared in book.author.name.Eminescu vs. a markup hierarchy like
a bipartite graph on which specific algorithms are applied. div.p.b.PCDATA). In this model, if we change the order, we
Finally, we present the Page Digest design of Web pages, a change the semantics of hierarchies or this semantics becomes
manner of encoding that clearly separates structural elements incoherent.
of Web documents from their content, while remaining highly Ordered tree model is also used in [20].
compact.
Unordered trees.
4.1 Flat-files The unordered labeled tree model does not consider the or-
Some early change detection systems model Web pages as der of appearance of elements in the tree as relevant, instead
flat files [15]. As these models do not take into account the only the parent-child relationships are captured.
hierarchical structure of HTML documents and neither the It is mentioned in MHDiff [9] that the change detection
characteristics of the layout, they can detect only content problem for unordered trees is harder that for ordered ones.
changes – and this without making any semantic difference Like [17, 9], [22] constructs a weighted bipartite graph from
in the content. the two trees given as entry. In these models, the order has
Some works [34] try to filter first irrelevant content by using not an importance on the final structure for which further
heuristics on the type of content and regular expressions. processing will be done, therefore this feature is not captured.
After this basic filtering, the Web page content is direcly [3] delimits and encodes subtrees; these are generated by
hashed and compared between versions. Unfortunately, we analyzing the level of a node in the unordered tree. From
can never filter all kind of inconvenient content, especially the root, when we arrive at level % 3, a subtree is created;
when its manner of encoding or type get more complex. the node at (level % 3 + 1) becomes the local root of the
following subtree, and this iteratively until the end of the
4.2 Trees hierarchy. Each subtree is marked with the tag name of
A natural approach is to represent a Web page as a tree its local root and indexed based on its start and end node
using the DOM model. By taking into account the hier- identifiers. A hashtable will finally map metadata about
archies, structural and attribute changes can be detected nodes (like tag name, content, path to the root, attributes
besides content changes. pair values, etc.).
Differences between tree models appear in their ordered
characteristics and in the level of granularity on which change
5
is detected: node, branch, subtree or “object”. We will further http://tidy.sourceforge.net/
28
TWAW 2011, Hyderabad, India
4.3 Bipartite graph model elements will be further studied. A typical matching
The bipartite graph model is a model derived from un- is based on comparisons of attribute values of the model
ordered trees: it consists in two independent sets (acquired elements. If they have the same properties, then they are
after tree pruning), each corresponding to a version, that similar.
are connected by cost edges. As set elements, subtrees are We will describe next the types of similarity metrics used
chosen over nodes in [22, 17]. The assumption [17] is that for change detection in different settings or applications, for
we might be more interested in which subtree the change the studied cases.
occurs, than in which specific node.
The cost of an edge represent the cost of the edit scripting 5.1 String matching techniques
needed to make a model entity (node or subtree) of the first Approaches that compare two Web pages modeled as flat
set isomorphic with one of the second set. The similarities files (i.e. strings), rely on hash-based methods, edit distance
between all subtrees in the first tree and all those of the metrics or techniques based on the longest common subse-
second tree are computed and placed in a cost matrix. Having quence. We do not exhaustively cover all the possiblities,
this matrix, the Hungarian [6] algorithm is used to find but rather present some of them that we have encountered
in polynomial time a minimum-cost bijection between the in the analyzed works.
two partitions. This algorithm is typically used in linear
programming to find the optimal solution to the assignment Naïve: Jaccard.
problem. One simple technique for change detection in text is to
CMW [17] as well as MH-Diff [9] algorithm, are based count the number of words that are common (regardless of
on transforming the change detection problem in one of their position) for two string sequences and to divide the
computing a minimum cost edge cover on a bipartite graph. result by the number of distinct words. Another possiblity
Optimizations are brought out in [22] in comparison with is to divide by the length of the first string, as done in [30].
the work in [6], but the general aim and similarity metrics
remain the same as in [3]. Hash-based methods.
In this category, we mention [8, 20, 3]. The method of [8]
4.4 Page Digest uses shingling, which is a flexible method of detecting changes
The page digest model6 has been adopted in [31] and [38], between two strings. It is usually referred to as 𝑤-shingling,
and represents a more compact encoding of data than HTML where 𝑤 the denotes the number of tokens in each shingle
and XML formats, while preserving all their advantages. To in the set. A shingle is a contiguous subsequence (𝑤-gram)
construct this model, some steps are performed: counting of the reference string. For the two strings to be compared,
the nodes, enumerating children in a depth-first manner (in their shingles are hashed; if these strings have a lot of shingle
order to capture the structure of the document), and content values in common, then they are similar.
encoding and mapping for each node - encoding which will Another method is to use signatures. A signature of a
preserve the natural order of text in the Web page. string is a function of its hashed value. When the model is
SDiff [31] is a Web change monitoring application that hierarchical, the signature of a node is computed based on its
uses a digest format that includes also tag type and attribute terminal path. For a formatting (or non-leaf) node, the signa-
information. RMS [38] also uses this model, although without ture represents the sum of the signatures of its children, until
giving many details. The advantages of the Page Digest over leaf nodes, where changes are actually detected. In [20], only
the DOM tree model are enumerated in [31]. We mention nodes that have different signatures from those in the original
minimality and execution performance: the reduction of tag version will be compared. Change detection algorithms that
redundancy gives a more compact model, therefore a faster employ signatures have the disadvantage that false negatives
document traversal. The algorithms developed on this model are possible: change exists, but a different signature for it
run in linear time without making too many heuristics or does not. It obviously depends on the careful choice of the
restrictions, while capturing also a large palette of changes. space of hashing, and eventually on the application: if it can
tolerate or not false results.
Another hash-based approach is presented in [3], where
5. SIMILARITY METRICS the change is detected at atomic element level using a direct
Similarity metrics are used in dynamic methods of de- hash.
tecting change, in the matching stage of the two versions
modelled as described in section 4.
Longest common subsequence.
For string models, (content) change is identified when the
A diff is a file comparison program that outputs the differ-
data strings are discovered to be partially unsimilar. For
ences between two files. Diff tools are based on computing the
tree models that have various dimensions, the problem gets
longest common subsequence between two strings [32]. For
more complex.
instance, HTMLDiff uses GNU diff utility adapted for HTML
The aim is to identify model elements that are essentially
page change detection. This program treats Web pages as
the same, but which have been affected by change. Model
strings and, after processing, highlights the changes directly
elements that are the identical are pruned because they
in a merged document; as mentioned in [15], HTMLDiff
have basically not changed between versions; also, totally
can consume significant memory and computation resources,
dissimilar model elements do not represent instances of the
and this might have an influence on the scalability of the
object that has evolved, so will not be included in further
tool. Tree models of Web pages also use diff techniques, but
processing steps. Essentially, only approximatively similar
not at HTML page level, but at subtree (or node) content
6
http://www.cc.gatech.edu/projects/disl/ level, which has the advantage of making the computation
PageDigest/ less complex. For instance, WebCQ [24] uses HTMLDiff to
29
TWAW 2011, Hyderabad, India
detect change at object level. Here, the object represents in decisions. MH-Diff [9] for example defines the costs as being
fact a part of a Web page specified for monitoring by the moderated by some constants, all depending on the type of
user, either by means of regular expressions or by marking change.
elements of the HTML DOM tree like a table, list, link etc..
Another example of system that uses GNU diff tool is Quantitative measures of change.
WebVigiL [18]. Various works [17, 3, 22] use a composed measure of simi-
larity that tries to better adapt to the specific of the types
Edit scripting on strings. of changes considered. The change is measured by a formula
The edit distance between two strings of characters rep- that incorporates three sub-measures of specific similarity:
resents the number of operations required to transform one on the content (intersect: the percentage of words that ap-
string into another. There are different ways of defining pear in both textual content of subtrees), attributes (attdist:
an edit distance, depending on which edit operations are the relative weight of the attributes that have the same value
allowed: delete, insert, etc.. In string edit scripting, the in the model elements) and on the types of elements consid-
atomic element is a single character and the cost is usually ered in the path (typedist emphasizes differences on the tag
unitary, for every edit operation defined. names when going up in the hierarchy). The final measure
incorporates all above-defined types of similarity, together
Root Mean Square. with some parameters that are meant to influence the impor-
Another notable way [38] of compute similarity is to use tance of certain types of changes over others. The advantage
RMS(Root Mean Square) value, i.e., the quadratic mean. of this measure is that it captures the symbiosis of different
RMS represents a statistical measure for the magnitude of types of changes that occur in a certain way independently:
a varying quantity and permits therefore to quantify the content changes in leaf nodes, attribute changes in internal
change. If its value is small, then the difference between nodes; the third submeasure is more focused on the position
the compared elements is not significative. Often used in of nodes in the structure.
engineering to do an estimation of the similarity between a Another quantitative measure of change is proposed in [23].
canonical model and an empirical one (in order to see the Here, a weighted measure that determines the magnitude of
precision of the experiment), RMS formula needs numeric the difference between two ordered multisets (i.e., branches)
values of the model parameters. In the HTML context, is employed. In an ordered multiset, the weight of the ith
ASCII values of the Web document’s text characters are node is defined as (2𝑖 )−1 (where 𝑖 represents the depth of an
utilized in the canonical formula. Although a good idea, element of the branch considered). Finally, the quantity of
RMS measure applied for the ASCII values of each character change is measured by computing the sum of the weights of
has some drawbacks: first it does not take into account the nodes that appear in the symmetric difference set.
the hierarchies (it considers that every character has equal
influence, independently of its position in the page), and 6. STATISTICALLY ESTIMATING CHANGE
second, it cannot take into account the semantics of context.
Variants of this measure are presented in [39]. 6.1 Motivating estimative models
We have seen the multitude of domains that are interested
5.2 Matching of hierarchical models in the change detection issue. The general aim is to get an
idea about the dynamics of a certain type of content, at a
Edit scripting on trees. given granularity.
has the same formal definition as for strings, excepting for In Web crawler-related applications, the interest is more
the fact that the basic entities are here tree elements (nodes in whether a Web page has changed or not, in order to know
or subtrees). Also, edit operations can occur at different if a new version of a Web page shall be downloaded or not.
levels, depending on the types of changes considered. As a Only a binary response is needed; this does not happen
consequence, cost models of edit operations become more because it is not interesting to make a distinction between
complex. Each edit operation has a cost associated (cost the different types of changes that can occur, but because
proportional to the complexity of the operation, or based on current crawlers treat information at Web page level. In this
heuristics of the model), so the execution of an edit script case, an estimation of the change frequency is as effective as
as a sequence of operations will return a cumulated cost. explicitly computing it, as we show in Section 3. Indeed, if
The similarity measure can be then computed as the inverse archive crawlers were more aware of the semantics of data
of this total cost. The interpretation is the following: less they process, they could clearly benefit of a broader, richer
unimportant modifications we do to the first structure of insight into the data and could develop different strategies
data in order to make it isomorphic with its version, the related to storage and processing. An estimation of the
more similar the two structures are. change rate, although not very descriptive (we usually do not
This problem of determining the distance between two know where the change appeared or its type), is still useful,
trees is referred to as the tree-to-tree correction problem especially when we can imagine a strategy that combines
and this is more in depth covered in [5]. Some works [9] estimative and comparative methods of deriving dynamics.
report that edit scripting with moving operations is NP- For this reason, we shortly present some of the existing
hard. Usually, every model introduces some kind of model statistical approaches.
or computational heuristics that make the problem (a little)
less difficult. As an example, [23] computes an edit scripting 6.2 Poisson model
on branches. It can be also done on subtrees, rather than on The study carried out in [10] reports the fact that changes
complete trees, for the same complexity reasons. Concerning that occur in Web pages can be modeled as a Poisson process.
the cost of change operations, every technique makes its own A Poisson process is used to model a sequence of random
30
TWAW 2011, Hyderabad, India
ing of temporal properties of Web pages. The selection of
subjects reflects our personal vision on the topic.
Further studies on timestamping.
With the large number of sources of timestamping hints,
it should indeed be possible to estimate the freshness of
a Web page. An experimental study on the reliability of
these sources, perhaps in specific contexts (a given Web
server software, a given CMS, etc.), which could provide
more insight into optimal strategies for timestamping Web
pages, is still to be carried out.
Relevant change detection.
An important question when detecting changes is whether
the changes are relevant to the interests or needs of the user
or archivist. Web users are exposed to many advertisements
and content becomes pre-fabricated, put in containers and
delivered in a targeted way, so much more attention should
be paid to the relevance factor. Additionally, Web pages have
some components that are more dynamic than others; it is not
possible to say if the dynamics come from irrelevant content
(think of ads changing at every request) or precisely because
Figure 1: Summary of the presented approaches the content is very informative (e.g. frequently updated
news). To make the distinction between these, techniques
that define and extract semantics (e.g., by looking at the
events that happen independently with a fixed rate over time. topic and linked data) might be used as a filtering method
Based on a(n ideally complete) change history, the frequency or just for adding more significance to the change detection
of change is estimated. The time-independence assumption results. The cleaning of the Web page or its segmentation into
in homogeneous models [10, 11] does not really capture the semantic blocks is of interest in many information extraction
reality of the Web. The authors of [33] affirm that in the fields and, although we mention only [40, 21, 29], there exists
case of dynamic Web content (like blogs), the posting rates a large number of works that treat this subject.
vary very much depending on different parameters. Hence, As an observation, there exists a subtle difference in the
they propose an inhomogeneous Poisson model (which do not use of the term “meaningful” in the various works that we
assume that events happen independently); this model learns have studied. While some works [23] use it to emphasize
the posting patterns of Web pages and predicts a recheck for the fact that more types of changes are detected, other ap-
new content. proaches [30] use it as synonym to “relevant”, from the con-
The work in [11] formalizes some use cases where, by adapt- tent point of view. Vi-Diff [30] uses the VIPS algorithm [40]
ing the parameters of the Poisson model to the requirements to get from a Web page an hierarchy of semantic blocks and
of the application, a better accuracy of the estimation can detect changes only from this perspective, hopefully ignoring
be achieved. The situation when we do not have a complete all boilerplate content. However, a deeper insight into the
change history of a Web page is treated, which is actually relevance aspect is mentioned as future work; the authors
the real world case. Sometimes, we have only the last date of of [30] talk about using machine learning techniques for this
change or we at best just know that a change has occurred. issue, which would be an interesting line of research. Usually,
Contributions are also related to the fact that the authors heuristics [40] are employed to get a measure of relevance
adapt the canonical model to interesting applications and because having a generic source of knowledge which can be
feed different estimators, improving thus the technique for automatically used for this task is very difficult.
the considered cases. Recently, [29] used the concepts that can be extracted from
the description of Web feeds to get the content of interest
6.3 Kalman filters from Web pages. However, this kind of semantics can only be
Other statistical approach to change detection is that of obtained in the case of Web pages that have feeds associated.
[7]. Here, the textual vector space model is employed to With the emergence of the Semantic Web, we envision new
identify the patterns of the page and to train Kalman filters ways of distinguishing timestamps or of filtering irrelevant
with these patterns. In the end, the change represents the content from Web pages and therefore more efficient methods
event that does not match the prediction. The possible for deriving dynamics of Web pages.
disadvantages of this method is that it assumes the linearity
of the system (because of the the Kalman filter model, which 8. REFERENCES
is doing an exact inference in a linear dynamical system) [1] S. Abiteboul. Issues in monitoring web data. In Proc.
and uses an incomplete vector space model (for complexity DEXA, 2002.
reasons). [2] E. Adar, J. Teevan, S. T. Dumais, and J. L. Elsas. The
Web changes everything: Understanding the dynamics
7. OPEN QUESTIONS of Web content. In Proc. WSDM, 2009.
We end this article by discussing several issues that have [3] H. Artail and K. Fawaz. A fast HTML Web change
not been addressed yet and which are related to the deriv- detection approach based on hashing and reducing the
31
TWAW 2011, Hyderabad, India
number of similarity computations. Data Knowl. Eng., [24] L. Liu, C. Pu, and W. Tang. WebCQ: Detecting and
66(2), 2008. delivering information changes on the Web. In Proc.
[4] R. Baeza-Yates, C. Castillo, and F. Saint-Jean. Web CIKM, 2000.
dynamics, structure, and page quality. In M. Levene [25] J. Masanès, editor. Web Archiving. Springer-Verlag,
and A. Poulovassilis, editors, Web Dynamics. Springer, 2006.
2004. [26] Netcraft. January 2011 Web survey.
[5] D. T. Barnard, G. Clarke, and N. Duncan. Tree-to-tree http://news.netcraft.com/archives/2011/01/12/
correction for document trees. Technical Report 95-372, january-2011-web-server-survey-4.html, 2011.
Queen’s University, Kingston, Ontario, Canada, 1995. [27] A. Ntoulas, J. Cho, and C. Olston. What’s new on the
[6] D. P. Bertsekas and D. A. Castañon. Parallel Web? The evolution of the Web from a search engine
asynchronous Hungarian methods for the assignment perspective. In Proc. WWW, 2004.
problem. INFORMS J. Computing, 5(3), 1993. [28] S. Nunes, C. Ribeiro, and G. David. Using neighbors to
[7] P. L. Bogen, II, J. Johnson, U. Karadkar, R. Furuta, date Web documents. In Proc. WIDM, 2007.
and F. M. Shipman, III. Application of Kalman filters [29] M. Oita and P. Senellart. Archiving data objects using
to identify unexpected change in blogs. In Proc. JCDL, web feeds. In Proc. IWAW, 2010.
2008. [30] Z. Pehlivan, M. Ben Saad, and S. Gançarski. A novel
[8] A. Z. Broder. On the resemblance and containment of Web archiving approach based on visual pages analysis.
documents. In Proc. SEQUENCES, 1997. In Proc. IWAW, 2009.
[9] S. Chawathe and H. Garcia-Molina. Meaningful change [31] D. Rocco, D. Buttler, and L. Liu. Page digest for
detection in structured data. In Proc. SIGMOD, 1997. large-scale Web services. In Proc. CEC, 2003.
[10] J. Cho and H. Garcia-Molina. The evolution of the [32] S. C. Sahinalp and A. Utis. Hardness of string
Web and implications for an incremental crawler. In similarity search and other indexing problems. In Proc.
Proc. VLDB, 2000. ICALP, 2004.
[11] J. Cho and H. Garcia-Molina. Estimating frequency of [33] K. C. Sia, J. Cho, and H.-K. Cho. Efficient monitoring
change. ACM TOIT, 3(3), 2003. algorithm for fast news alerts. IEEE TKDE, 19(7),
[12] L. R. Clausen. Concerning Etags and datestamps. In 2007.
Proc. IWAW, 2004. [34] K. Sigurðsson. Incremental crawling with Heritrix. In
[13] G. Cobéna and T. Abdessalem. A comparative study of Proc. IWAW, 2005.
XML change detection algorithms. In Service and [35] sitemaps.org. Sitemaps XML format.
Business Computing Solutions with XML. IGI Global, http://www.sitemaps.org/protocol.php, 2008.
2009. [36] W3C. HTML 4.01 specification.
[14] G. Cobéna, S. Abiteboul, and A. Marian. Detecting http://www.w3.org/TR/REC-html40/, 1999.
changes in XML documents. In Proc. ICDE, 2002. [37] Y. Wang, D. J. DeWitt, and J.-Y. Cai. X-Diff: An
[15] F. Douglis, T. Ball, Y.-F. Chen, and E. Koutsofios. The effective change detection algorithm for XML
AT&T Internet difference engine: Tracking and viewing documents. In Proc. ICDE, 2003.
changes on the Web. World Wide Web, 1(1), 1998. [38] D. Yadav, A. K. Sharma, and J. P. Gupta. Change
[16] D. Fetterly, M. Manasse, M. Najork, and J. Wiener. A detection in Web pages. In Proc. ICIT, 2007.
large-scale study of the evolution of Web pages. In [39] D. Yadav, A. K. Sharma, and J. P. Gupta. Parallel
Proc. WWW, 2003. crawler architecture and web page change detection.
[17] S. Flesca and E. Masciari. Efficient and effective Web WSEAS Transactions on Computers, 7(7), 2008.
page change detection. Data Knowl. Eng., 46(2), 2003. [40] S. Yu, D. Cai, J.-R. Wen, and W.-Y. Ma. Improving
[18] J. Jacob, A. Sanka, N. Pandrangi, and S. Chakravarthy. pseudo-relevance feedback in Web information retrieval
WebVigiL: An approach to just-in-time information using Web page segmentation. In Proc. WWW, 2003.
propagation in large network-centric environments. In
M. Levene and A. Poulovassilis, editors, Web Dynamics.
Springer, 2004.
[19] A. Jatowt, Y. Kawai, and K. Tanaka. Detecting age of
page content. In Proc. WIDM, 2007.
[20] H. P. Khandagale and P. P. Halkarnikar. A novel
approach for web page change detection system. Intl. J.
Comput. Theory Eng., 2(3), 2010.
[21] C. Kholschutter, P. Fankhauser, and W. Nejdi.
Boilerplate detection using shallow text features. In
Proc. WSDM, 2010.
[22] I. Khoury, R. M. El-Mawas, O. El-Rawas, E. F.
Mounayar, and H. Artail. An efficient Web page
change detection system based on an optimized
Hungarian algorithm. IEEE TKDE, 19(5), 2007.
[23] S.-J. Lim and Y.-K. Ng. An automated
change-detection algorithm for HTML documents
based on semantic hierarchies. In Proc. ICDE, 2001.
32