<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Marilena Oita</string-name>
          <email>marilena.oita@telecom-paristech.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pierre Senellart</string-name>
          <email>pierre.senellart@telecom-paristech.fr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>INRIA Saclay - Île-de-France, &amp; Télécom ParisTech; CNRS LTCI</institution>
          ,
          <addr-line>Paris</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Institut Télécom, Télécom ParisTech; CNRS LTCI</institution>
          ,
          <addr-line>Paris</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2011</year>
      </pub-date>
      <fpage>25</fpage>
      <lpage>32</lpage>
      <abstract>
        <p>The World Wide Web is dynamic by nature: content is continuously added, deleted, or changed, which makes it challenging for Web crawlers to keep up-to-date with the current version of a Web page, all the more so since not all apparent changes are significant ones. We review major approaches to change detection in Web pages and extraction of temporal properties (especially, timestamps) of Web pages. We focus our attention on techniques and systems that have been proposed in the last ten years and we analyze them to get some insight into the practical solutions and best practices available. We aim at providing an analytical view of the range of methods that can be used, distinguishing them on several dimensions, especially, their static or dynamic nature, the modeling of Web pages, or, for dynamic methods relying on comparison of successive versions of a page, the similarity metrics used. We advocate for more comprehensive studies of the efectiveness of Web page change detection methods, and finally highlight open issues.</p>
      </abstract>
      <kwd-group>
        <kwd>Change monitoring</kwd>
        <kwd>Web archiving</kwd>
        <kwd>Timestamping</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Categories and Subject Descriptors</title>
      <p>H.3.5 [Information Storage and Retrieval]: Online
Information Services— Web-based services</p>
    </sec>
    <sec id="sec-2">
      <title>1. INTRODUCTION</title>
      <p>The World Wide Web challenges our capacity to develop
tools that can keep track of the huge amount of
information that is getting modified at speed rate. Web archiving
* This research was funded by the European Research Council
grant Webdam FP7-ICT-226513.</p>
      <p>Copyright 2011 for the individual papers by the papers’ authors. Copying
permitted only for private and academic purposes. This volume is published
and copyrighted by its editors.</p>
      <p>
        TWAW 2011 March 28, 2011, Hyderabad, India
crawlers [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ], especially, need the ability to detect change
in Web content and to infer when a Web page last changed.
This ability is fundamental in maintaining the coherence of
the crawl, in adjusting its refresh rate, in versioning, and
to allow a user to retrieve meaningful temporal data. The
understanding of the dynamics of Web pages, that is, how
fast the Web content changes and what the nature of these
changes is, its implications on the structure of the Web,
and the correlation with the topic of pages are also popular
subjects in the research literature [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        In addition to being of paramount importance to Web
archiving, the subject of change detection is of interest in
various applications and domains, such as: large-scale
information monitoring and delivery systems [
        <xref ref-type="bibr" rid="ref15 ref17 ref18 ref23 ref24">18, 15, 24, 17, 23</xref>
        ] or
services1, Web cache improvement [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], version configuration
and management of Web archives [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ], active databases [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ],
servicing of continuous queries [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Research has focused on
ifnding novel techniques for comparing snapshots of Web
pages (a reference Web page and its updated version) in
order to detect change and estimate its frequency. Change
can however be detected at various levels: there are various
aspects of dynamics which must be considered when studying
how Web content changes and evolves.
      </p>
      <p>
        The majority of works have seen the change detection
problem from a document-centric perspective, as opposed to
an object-centric one. By object or entity we mean here any
Web content, part of a Web page, that represents meaningful
information per se: image, news article, blog post, comment,
etc.. Comparatively, little efort has been put on making the
diference between relevant changes and those that might
occur because of the dynamic template of a Web page (ads,
active layout, etc.), i.e., its boilerplate [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ].
      </p>
      <p>We study in this article some of the strategies that have
been established in diferent settings, with the aim at
providing an overview of the existing techniques used to derive
temporal properties of Web pages.</p>
      <p>
        There is a large body of work on the related problem
of change detection in XML documents, particularly for
purposes of data integration and update management in
XML-centric databases. However, the solutions developed for
XML documents cannot be applied without serious revisions
for HTML pages. The model assumptions made for XML
do not really hold for HTML. Indeed, the HTML and XML
formats have a key diference: while an XML page defines
the nature of the content by its meta tags, HTML tags define
1Examples of these include http://timelyweb.en.
softonic.com/, http://www.urlywarning.net/, http://
www.changealarm.com/, http://www.changedetect.com/.
mainly presentational aspects of content within. In addition
to the challenges that exist in XML documents, Web pages
add some others by their lack of formal semantics, a fuzziness
regarding the formatting, by the embedding of multimedia
and scripts, etc.. Separate approaches commonly need to be
adopted and the research done on XML documents is beyond
the scope of our paper, although we mention some works on
XML [
        <xref ref-type="bibr" rid="ref14 ref37">37, 14</xref>
        ] that have particular relevance to Web page
change detection. We refer the reader to [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] for a survey of
XML change detection algorithms.
      </p>
      <p>We focus in this survey on deriving dynamics of HTML
documents.</p>
      <p>Change detection mechanisms can either be static,
estimating the date of last modification of content from the Web
page itself (its code, semantics or neighbors), or dynamic, by
comparing successive versions of a Web page. The structure
of this article reflects these dimensions. In Section 2, we
present static approaches to timestamping Web pages, while
Section 3 introduce dynamic methods. We then analyze in
Section 4 the diferent models of a Web page used by
existing techniques for comparing successive versions. Similarity
metrics used in dynamic methods are independently
investigated in Section 5. We briefly describe statistical modeling
approaches to estimate change frequency in Section 6. We
conclude with a discussion of some remaining open questions.
2.</p>
    </sec>
    <sec id="sec-3">
      <title>STATIC APPROACHES: TIMESTAMPING</title>
      <p>
        This section deals with methods for inferring temporal
properties of a Web page in a static manner as opposed to
the commonly used dynamic computation of the diference
between successive versions of a given Web page. The goal
here is to infer the creation or the last modification date of a
Web page or, possibly, of some parts of it. We study sources
of data that can be useful for that purpose: metadata, the
content of the Web page itself, or its graph neighborhood.
The canonical way for timestamping a Web page is to use
the Last-Modified HTTP header. Unfortunately, studies
have shown this approach is not reliable in general [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
We describe next why this happens in practice and other
techniques for timestamping Web pages.
2.1
      </p>
    </sec>
    <sec id="sec-4">
      <title>HTTP metadata</title>
      <p>HTTP/1.1, the main protocol used by Web clients and
servers to exchange information, ofers several features of
interest for timestamping, the foremost of which are the ETag
and Last-Modified HTTP response headers. Entity tags (or
ETags) are unique identifiers for a given version of a
particular document. They are supposed to change if and only if the
document itself changes. Servers can return this with any
response, and clients can use the If-Match and If-None-Match
HTTP requests headers to condition the retrieval of the
document to a change in the ETag value, avoiding then to
retrieve already known contents. If-Modified-Since and
If-Unmodified-Since provide conditional downloading
features, in a similar way as for ETags. Even when conditional
downloading is not possible, Etags and HTTP timestamps
can be retrieved by a Web client without downloading a whole
Web page by making use of the head HTTP method. The
problem is that while this information is generally provided
and is very reliable for static content (e.g., static HTML
pages or PDF), it is most of the time missing or changed at
every request (the timestamp given is that of the request,
not of the content change) when the content is dynamic
(generated by content management systems, etc.). Some CMSs
do return correct HTTP timestamps, such as MediaWiki2,
but they seem to be a minority.</p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], Clausen presents an experimental study of the
reliability of Etags and HTTP timestamps on a collection
of a few million Danish Web pages. He finds out that the
best strategy for avoiding useless downloads of versions of
Web pages already available in a Web archive is to always
download when the ETag server is missing, and otherwise
download only if the Last-Modified header indicates change.
This rather counterintuitive result yielded in this
experimental study an almost perfect prediction of change, and a 63%
accuracy in predicting non-change. Given that the majority
of Web servers run some version of the open-source Apache3
HTTP server [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ], it would be interesting to see whether this
strategy is correlated with some inherent behavior of this
software. Furthermore, repeating this experiment on a larger
scale and with a more recent set of Web pages would be of
interest.
      </p>
      <p>HTTP also provides the Cache-Control and Expires
response headers. This information is often given, but with a
zero or very low expiration delay, which means that nothing
interesting can be derived from it. In some specific and
controlled environments (e.g., Intranets), it might still be
useful to look at these two pieces of information to estimate
the refresh rate of a Web page.
2.2</p>
    </sec>
    <sec id="sec-5">
      <title>Timestamps in Web content</title>
      <p>Content management systems, as well as Web authors,
often provide in the content of a Web page some
humanreadable information about its last date of modification. This
can be a global timestamp (for instance, preceded by a “Last
modified” string, in the footer of a Web page) or a set of
timestamps for individual items in the page, such as news
stories, blog posts, comments, etc. In the latter case, the
global timestamp might be computed as the maximum of
the set of individual timestamps. It is actually quite easy to
extract and recognize such information, with keyword
selection (last, modification , date, etc.) or with entity recognizers
for dates (built out of simple regular expressions). However,
this timestamp is often quite informal and partial: there
is sometimes no time indication, and most of the time no
timezone. To the best of our knowledge, no formal study of
the precision reached by extracting timestamps from Web
content has been carried out.
2.3</p>
    </sec>
    <sec id="sec-6">
      <title>Semantic temporal associations</title>
      <p>
        In addition to these timestamps provided to humans,
documents on the Web may include additional semantic
timestamps meant for machines. No mechanism for this exists in
HTML per se, but the HTML specification [
        <xref ref-type="bibr" rid="ref36">36</xref>
        ] allows for
arbitrary metadata in the form of &lt;meta&gt; tags, one particular
profile of such metadata being Dublin Core 4 whose modified
term indicates the date of last modification of a Web page.
Both content management systems and Web authors
occasionally use this possibility. Web feeds (in RSS or Atom
formats) also have semantic timestamps, which are quite
reliable, since they are essential to the working of applications
that exploit them, such as feed readers. In some cases,
external semantic content can be used for dating an HTML Web
      </p>
      <sec id="sec-6-1">
        <title>2http://www.mediawiki.org/</title>
      </sec>
      <sec id="sec-6-2">
        <title>3http://www.apache.org/</title>
      </sec>
      <sec id="sec-6-3">
        <title>4http://dublincore.org/documents/dcmi-terms/</title>
        <p>
          page: for instance, an RSS feed containing blog entries can
be mapped to the corresponding Web page, in order to date
individual items [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ]. Another case is that of sitemaps [
          <xref ref-type="bibr" rid="ref35">35</xref>
          ].
Sitemaps are files that can by provided by the owner of a
Web site to describe its organization, so as to improve its
indexing by Web search engines. Sitemaps allow for both
timestamps and change rate indications (hourly, monthly,
etc.), but these features are not often used. Very few content
management systems produce all of this, although it would
have been the ideal case: the download of a single file would
sufice to get all timestamping information about the whole
Web site.
2.4
        </p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Using the neighborhood</title>
      <p>
        It is possible to use the graph structure of the Web to help
timestamping Web pages: [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ] uses the neighboring pages
of a Web page to estimate its timestamp. When no source
of reliable timestamps is found for a given page using one
of the technique described above, its timestamp is set to
some form of average of the timestamps of pages pointed to
and by this page. The inherent assumption is that pages
linked together tend to have a similar update patterns. The
precision is not very high, but better than nothing when no
other information is available.
      </p>
    </sec>
    <sec id="sec-8">
      <title>DYNAMIC METHODS</title>
      <p>When static techniques do not give adequate results, there
is still the possibility of comparing a Web page with its
previous version in order to determine (sometimes in a rough
way) an equivalent of Last-Modified. The timestamp that
gives the last modification of a Web page can be inferred
then as the date when change has been detected.</p>
      <p>Nevertheless, for a precise estimation, a just-in-time crawl
of versions is needed. In reality, the frequency of change is
quite dificult to estimate because Web pages have diferent
patterns of change. In general, many factors determine
variations in the frequency of change for a given Web page:
the CMS, the subject, the time of the year, even the time of
the day, etc.</p>
      <p>
        Estimating the Web pages’ frequency of change is the
subject of many studies [
        <xref ref-type="bibr" rid="ref10 ref16 ref19 ref2 ref27">16, 2, 27, 10, 19</xref>
        ]. Their results
are however heavily dependent on the technique of detecting
change that they have used.
      </p>
      <p>There are two parameters that influence the process of
determining the dynamics:
1. the frequency of change is not known in advance,
but if we do not crawl the Web pages on time, we
miss versions and the timestamp detected will be then
imprecise;
2. the change detection technique, which heavily
relies on the similarity metrics and on the model (that
can capture more or less types of changes) and the
filtering of dynamic elements that influence the frequency
without being truly relevant (and which occur quite
often in Web pages because of AJAX applications or
advertisements).</p>
      <p>A method of filtering irrelevant content is to know what is
important rather than trying to filter what is unimportant.
If we knew in advance the frequency of crawl, then we would
know when change occurs and therefore set timestamps for
new versions. This is unfortunately not the case, but we need
however to crawl frequently enough (better too frequently
than not frequent enough). Once we have these versions,
we can detect if there is any change that happened in the
interval of time that represents the interval of crawl. Based
on this, timestamps can be derived with good approximation.</p>
      <p>The majority of works consider that they have an access
to the versions and they focus on detecting change eficiently.
Few make a semantic distinction between changes by
disregarding insignificant ones. Next, we present some general
notions about changes: what kind of changes can occur in
Web pages, which have been identified in our studied
approaches and which have not, and finally we give an insight
into how change is actually represented.
3.1</p>
    </sec>
    <sec id="sec-9">
      <title>Types of changes</title>
      <p>
        We summarize here the types of changes detected by
diferent approaches. There are however other, more sophisticated
types that are sometimes mentioned, but not treated. For
instance, behavioral changes are mentioned in [
        <xref ref-type="bibr" rid="ref38">38</xref>
        ]; they occur
in active HTML elements like scripts, embedded applications
and multimedia. These new forms of Web content have a big
impact on the Web today, but they require a more complex
modeling.
      </p>
      <p>All considered approaches detect changes in content.</p>
      <p>
        Works that model the HTML page as a tree, including
page digest encoding, usually detect also structural and
attribute changes. However, diferences exist in the coverage
of cases for these types of changes. For example, MH-Dif [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]
detects also move and copy structural changes, which is an
improvement over the traditional detection of insert, delete
and update. Structural (or layout) changes occur when the
position of elements in the page is modified. Attribute
(or presentation) changes are related to the representation
of information, for instance changes in the font, colors or
captions. For capturing structural and attribute changes,
the model has to be aware their existence; this implies a
more complex model which influences generally in a negatice
manner the performance. Unlike flat-file models of Web
pages, the output of content change detection in hierarchical
models is more meaningful: the type of node in which the
content change occured can also be identified.
      </p>
      <p>
        There are also type changes, that are modifications which
come about when the HTML tags change: e.g., a p tag which
becomes a div. Type changes can be detected by [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ] which
uses the page digest encoding that provides a mechanism for
locating nodes of a particular type (see further).
      </p>
      <p>
        Semantic types of change capture the meaning of content
that has changed. They are defined in SCD [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ], a pioneer
work in this direction.
      </p>
      <p>
        Changes are sometimes captured in a quantitative
manner rather than in a qualitative one. In contrast with the
qualitative way, where the change is described (in a delta
ifle) or visualized in a comparative manner, quantitative
approaches estimate the amount of change of a specific type.
More specifically, all approaches that use the similarity
formula defined in CMW [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] do not reconstruct the complete
sequence of changes, but give a numerical value of it. In this
case, supposing a threshold of change, we can determine if a
page has changed or not, which actually represent more an
oracle response, still useful in the majority of applications.
3.2
      </p>
    </sec>
    <sec id="sec-10">
      <title>The representation of change</title>
      <p>
        There are various ways to present the diference between
two documents. Changes are usually stored in a physical
structure generically called delta file or delta tree. The format
of storing the change for RMS [
        <xref ref-type="bibr" rid="ref38">38</xref>
        ] consists in specialized
set of arrays that capture the relationships among the nodes
and the changes that occur both in structure and content.
Systems for monitoring change like [
        <xref ref-type="bibr" rid="ref18 ref24">24, 18</xref>
        ] have typically
a user interface and present changes in a graphical way.
HTMLdif merge the input Web page versions into one
document that will summarize all the common parts and
also the changed ones. The advantage is that the common
parts are displayed just once, but on the other hand, the
resulting merged HTML can be syntactically or semantically
incorrect. Another choice linked to change presentation is
to display only the diferences and omit the common parts
of the two Web pages. When the documents have a lot of
data in common, presenting only the diferences could be
better, with the drawback that the context is missing. The
last approach is to present the diferences between the old
and new version side by side.
      </p>
      <p>
        These presentation modes are used in combination, rather
than being the unique choice for a given system. For
example, [
        <xref ref-type="bibr" rid="ref18 ref24">24, 18</xref>
        ] are presenting the results of the change
monitoring service using a hybrid approach: presentation modes are
combined depening on the type of change that is presented.
      </p>
    </sec>
    <sec id="sec-11">
      <title>HTML DOCUMENT MODELS</title>
      <p>This section contains an overview of the models that are
considered in the quest of detecting changes in Web
documents. The modeling step is a key one as it will determine
the elements on which the comparison algorithms operate.</p>
      <p>We first discuss the “naïve” approach, that is, to consider
Web pages as flat files (strings); then we describe tree models,
a popular choice in the literature. We explore also some
approaches that are based on tree models, but which are
essentially transforming the two versions to be compared in
a bipartite graph on which specific algorithms are applied.
Finally, we present the Page Digest design of Web pages, a
manner of encoding that clearly separates structural elements
of Web documents from their content, while remaining highly
compact.
4.1</p>
    </sec>
    <sec id="sec-12">
      <title>Flat-files</title>
      <p>
        Some early change detection systems model Web pages as
lfat files [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. As these models do not take into account the
hierarchical structure of HTML documents and neither the
characteristics of the layout, they can detect only content
changes – and this without making any semantic diference
in the content.
      </p>
      <p>
        Some works [
        <xref ref-type="bibr" rid="ref34">34</xref>
        ] try to filter first irrelevant content by using
heuristics on the type of content and regular expressions.
After this basic filtering, the Web page content is direcly
hashed and compared between versions. Unfortunately, we
can never filter all kind of inconvenient content, especially
when its manner of encoding or type get more complex.
4.2
      </p>
    </sec>
    <sec id="sec-13">
      <title>Trees</title>
      <p>A natural approach is to represent a Web page as a tree
using the DOM model. By taking into account the
hierarchies, structural and attribute changes can be detected
besides content changes.</p>
      <p>Diferences between tree models appear in their ordered
characteristics and in the level of granularity on which change
is detected: node, branch, subtree or “object”. We will further
discuss these aspects.</p>
      <p>
        First, the modeling of a Web page into a tree requires
a preprocessing step of cleaning. This is a significant one
because it corrects missing or mismatching, out-of-order
end tags, as well as all other syntactic ill-formedness of the
HTML document. A tree is constructed first by filtering
the “tag soup” HTML into an XML document and second,
by manipulating the result in order to obtain a workable
tree structure using implementations of the DOM standard
(that usually employ XSLT and XPath). Sometimes also
an initial pruning of elements is done: [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] is filtering out
scripts, applets, embedded objects and comments.
      </p>
      <p>
        Many works do not specify how they realize the cleaning,
so either they assume it to be done in advance, or they solve
this issue by using an HTML cleaning tool. HTML Tidy5 is
well-suited for this purpose and mentioned in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        There are however cases [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] when the tree model does not
need to be cleaned: as the semantics of tags (value, name) is
leveraged in the technique, the structure does not have to
be enforced.
      </p>
      <sec id="sec-13-1">
        <title>Ordered trees.</title>
        <p>The ordered characteristic of trees implies that the order
of appearance of nodes is considered in the algorithm and
therefore included in the model.</p>
        <p>
          RMS algorithm [
          <xref ref-type="bibr" rid="ref38">38</xref>
          ] stores the level (depth) of a node,
information which will be used in the tree traversal and
parsing. The SCD [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ] algorithm uses branches of ordered
trees to detect semantic changes. The notion of branch is
used to give the context of a node and is formalized as an
ordered multiset, in which the elements are designated by the
tag name of the HTML non-leaf node, or its content (text),
if it represents a leaf node. The order is very important in
this model because of the data hierarchy considered (e.g.,
book.author.name.Eminescu vs. a markup hierarchy like
div.p.b.PCDATA). In this model, if we change the order, we
change the semantics of hierarchies or this semantics becomes
incoherent.
        </p>
        <p>
          Ordered tree model is also used in [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ].
        </p>
      </sec>
      <sec id="sec-13-2">
        <title>Unordered trees.</title>
        <p>The unordered labeled tree model does not consider the
order of appearance of elements in the tree as relevant, instead
only the parent-child relationships are captured.</p>
        <p>
          It is mentioned in MHDif [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] that the change detection
problem for unordered trees is harder that for ordered ones.
Like [
          <xref ref-type="bibr" rid="ref17 ref9">17, 9</xref>
          ], [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ] constructs a weighted bipartite graph from
the two trees given as entry. In these models, the order has
not an importance on the final structure for which further
processing will be done, therefore this feature is not captured.
        </p>
        <p>
          [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] delimits and encodes subtrees; these are generated by
analyzing the level of a node in the unordered tree. From
the root, when we arrive at level % 3, a subtree is created;
the node at (level % 3 + 1) becomes the local root of the
following subtree, and this iteratively until the end of the
hierarchy. Each subtree is marked with the tag name of
its local root and indexed based on its start and end node
identifiers. A hashtable will finally map metadata about
nodes (like tag name, content, path to the root, attributes
pair values, etc.).
        </p>
        <sec id="sec-13-2-1">
          <title>5http://tidy.sourceforge.net/</title>
          <p>4.3</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-14">
      <title>Bipartite graph</title>
      <p>
        The bipartite graph model is a model derived from
unordered trees: it consists in two independent sets (acquired
after tree pruning), each corresponding to a version, that
are connected by cost edges. As set elements, subtrees are
chosen over nodes in [
        <xref ref-type="bibr" rid="ref17 ref22">22, 17</xref>
        ]. The assumption [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] is that
we might be more interested in which subtree the change
occurs, than in which specific node.
      </p>
      <p>
        The cost of an edge represent the cost of the edit scripting
needed to make a model entity (node or subtree) of the first
set isomorphic with one of the second set. The similarities
between all subtrees in the first tree and all those of the
second tree are computed and placed in a cost matrix. Having
this matrix, the Hungarian [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] algorithm is used to find
in polynomial time a minimum-cost bijection between the
two partitions. This algorithm is typically used in linear
programming to find the optimal solution to the assignment
problem.
      </p>
      <p>
        CMW [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] as well as MH-Dif [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] algorithm, are based
on transforming the change detection problem in one of
computing a minimum cost edge cover on a bipartite graph.
Optimizations are brought out in [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] in comparison with
the work in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], but the general aim and similarity metrics
remain the same as in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
4.4
      </p>
    </sec>
    <sec id="sec-15">
      <title>Page Digest</title>
      <p>
        The page digest model6 has been adopted in [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ] and [
        <xref ref-type="bibr" rid="ref38">38</xref>
        ],
and represents a more compact encoding of data than HTML
and XML formats, while preserving all their advantages. To
construct this model, some steps are performed: counting
the nodes, enumerating children in a depth-first manner (in
order to capture the structure of the document), and content
encoding and mapping for each node - encoding which will
preserve the natural order of text in the Web page.
      </p>
      <p>
        SDif [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ] is a Web change monitoring application that
uses a digest format that includes also tag type and attribute
information. RMS [
        <xref ref-type="bibr" rid="ref38">38</xref>
        ] also uses this model, although without
giving many details. The advantages of the Page Digest over
the DOM tree model are enumerated in [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ]. We mention
minimality and execution performance: the reduction of tag
redundancy gives a more compact model, therefore a faster
document traversal. The algorithms developed on this model
run in linear time without making too many heuristics or
restrictions, while capturing also a large palette of changes.
      </p>
    </sec>
    <sec id="sec-16">
      <title>SIMILARITY METRICS</title>
      <p>Similarity metrics are used in dynamic methods of
detecting change, in the matching stage of the two versions
modelled as described in section 4.</p>
      <p>For string models, (content) change is identified when the
data strings are discovered to be partially unsimilar. For
tree models that have various dimensions, the problem gets
more complex.</p>
      <p>The aim is to identify model elements that are essentially
the same, but which have been afected by change. Model
elements that are the identical are pruned because they
have basically not changed between versions; also, totally
dissimilar model elements do not represent instances of the
object that has evolved, so will not be included in further
processing steps. Essentially, only approximatively similar</p>
      <sec id="sec-16-1">
        <title>6http://www.cc.gatech.edu/projects/disl/</title>
        <p>PageDigest/
model elements will be further studied. A typical matching
is based on comparisons of attribute values of the model
elements. If they have the same properties, then they are
similar.</p>
        <p>We will describe next the types of similarity metrics used
for change detection in diferent settings or applications, for
the studied cases.
5.1</p>
      </sec>
    </sec>
    <sec id="sec-17">
      <title>String matching techniques</title>
      <p>Approaches that compare two Web pages modeled as flat
ifles (i.e. strings), rely on hash-based methods, edit distance
metrics or techniques based on the longest common
subsequence. We do not exhaustively cover all the possiblities,
but rather present some of them that we have encountered
in the analyzed works.</p>
      <sec id="sec-17-1">
        <title>Naïve: Jaccard.</title>
        <p>
          One simple technique for change detection in text is to
count the number of words that are common (regardless of
their position) for two string sequences and to divide the
result by the number of distinct words. Another possiblity
is to divide by the length of the first string, as done in [
          <xref ref-type="bibr" rid="ref30">30</xref>
          ].
        </p>
      </sec>
      <sec id="sec-17-2">
        <title>Hash-based methods.</title>
        <p>
          In this category, we mention [
          <xref ref-type="bibr" rid="ref20 ref3 ref8">8, 20, 3</xref>
          ]. The method of [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]
uses shingling, which is a flexible method of detecting changes
between two strings. It is usually referred to as  -shingling,
where  the denotes the number of tokens in each shingle
in the set. A shingle is a contiguous subsequence ( -gram)
of the reference string. For the two strings to be compared,
their shingles are hashed; if these strings have a lot of shingle
values in common, then they are similar.
        </p>
        <p>
          Another method is to use signatures. A signature of a
string is a function of its hashed value. When the model is
hierarchical, the signature of a node is computed based on its
terminal path. For a formatting (or non-leaf) node, the
signature represents the sum of the signatures of its children, until
leaf nodes, where changes are actually detected. In [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ], only
nodes that have diferent signatures from those in the original
version will be compared. Change detection algorithms that
employ signatures have the disadvantage that false negatives
are possible: change exists, but a diferent signature for it
does not. It obviously depends on the careful choice of the
space of hashing, and eventually on the application: if it can
tolerate or not false results.
        </p>
        <p>
          Another hash-based approach is presented in [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], where
the change is detected at atomic element level using a direct
hash.
        </p>
      </sec>
      <sec id="sec-17-3">
        <title>Longest common subsequence.</title>
        <p>
          A dif is a file comparison program that outputs the
diferences between two files. Dif tools are based on computing the
longest common subsequence between two strings [
          <xref ref-type="bibr" rid="ref32">32</xref>
          ]. For
instance, HTMLDif uses GNU dif utility adapted for HTML
page change detection. This program treats Web pages as
strings and, after processing, highlights the changes directly
in a merged document; as mentioned in [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ], HTMLDif
can consume significant memory and computation resources,
and this might have an influence on the scalability of the
tool. Tree models of Web pages also use dif techniques, but
not at HTML page level, but at subtree (or node) content
level, which has the advantage of making the computation
less complex. For instance, WebCQ [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ] uses HTMLDif to
detect change at object level. Here, the object represents in
fact a part of a Web page specified for monitoring by the
user, either by means of regular expressions or by marking
elements of the HTML DOM tree like a table, list, link etc..
        </p>
        <p>
          Another example of system that uses GNU dif tool is
WebVigiL [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ].
        </p>
      </sec>
      <sec id="sec-17-4">
        <title>Edit scripting on strings.</title>
        <p>The edit distance between two strings of characters
represents the number of operations required to transform one
string into another. There are diferent ways of defining
an edit distance, depending on which edit operations are
allowed: delete, insert, etc.. In string edit scripting, the
atomic element is a single character and the cost is usually
unitary, for every edit operation defined.</p>
      </sec>
      <sec id="sec-17-5">
        <title>Root Mean Square.</title>
        <p>
          Another notable way [
          <xref ref-type="bibr" rid="ref38">38</xref>
          ] of compute similarity is to use
RMS(Root Mean Square) value, i.e., the quadratic mean.
RMS represents a statistical measure for the magnitude of
a varying quantity and permits therefore to quantify the
change. If its value is small, then the diference between
the compared elements is not significative. Often used in
engineering to do an estimation of the similarity between a
canonical model and an empirical one (in order to see the
precision of the experiment), RMS formula needs numeric
values of the model parameters. In the HTML context,
ASCII values of the Web document’s text characters are
utilized in the canonical formula. Although a good idea,
RMS measure applied for the ASCII values of each character
has some drawbacks: first it does not take into account
the hierarchies (it considers that every character has equal
influence, independently of its position in the page), and
second, it cannot take into account the semantics of context.
Variants of this measure are presented in [
          <xref ref-type="bibr" rid="ref39">39</xref>
          ].
5.2
        </p>
      </sec>
    </sec>
    <sec id="sec-18">
      <title>Matching of hierarchical models</title>
      <p>Edit scripting on trees.</p>
      <p>has the same formal definition as for strings, excepting for
the fact that the basic entities are here tree elements (nodes
or subtrees). Also, edit operations can occur at diferent
levels, depending on the types of changes considered. As a
consequence, cost models of edit operations become more
complex. Each edit operation has a cost associated (cost
proportional to the complexity of the operation, or based on
heuristics of the model), so the execution of an edit script
as a sequence of operations will return a cumulated cost.
The similarity measure can be then computed as the inverse
of this total cost. The interpretation is the following: less
unimportant modifications we do to the first structure of
data in order to make it isomorphic with its version, the
more similar the two structures are.</p>
      <p>
        This problem of determining the distance between two
trees is referred to as the tree-to-tree correction problem
and this is more in depth covered in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Some works [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]
report that edit scripting with moving operations is
NPhard. Usually, every model introduces some kind of model
or computational heuristics that make the problem (a little)
less dificult. As an example, [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] computes an edit scripting
on branches. It can be also done on subtrees, rather than on
complete trees, for the same complexity reasons. Concerning
the cost of change operations, every technique makes its own
decisions. MH-Dif [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] for example defines the costs as being
moderated by some constants, all depending on the type of
change.
      </p>
      <sec id="sec-18-1">
        <title>Quantitative measures of change.</title>
        <p>
          Various works [
          <xref ref-type="bibr" rid="ref17 ref22 ref3">17, 3, 22</xref>
          ] use a composed measure of
similarity that tries to better adapt to the specific of the types
of changes considered. The change is measured by a formula
that incorporates three sub-measures of specific similarity:
on the content (intersect: the percentage of words that
appear in both textual content of subtrees), attributes (attdist:
the relative weight of the attributes that have the same value
in the model elements) and on the types of elements
considered in the path (typedist emphasizes diferences on the tag
names when going up in the hierarchy). The final measure
incorporates all above-defined types of similarity, together
with some parameters that are meant to influence the
importance of certain types of changes over others. The advantage
of this measure is that it captures the symbiosis of diferent
types of changes that occur in a certain way independently:
content changes in leaf nodes, attribute changes in internal
nodes; the third submeasure is more focused on the position
of nodes in the structure.
        </p>
        <p>
          Another quantitative measure of change is proposed in [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ].
Here, a weighted measure that determines the magnitude of
the diference between two ordered multisets (i.e., branches)
is employed. In an ordered multiset, the weight of the ith
node is defined as (2 )− 1 (where  represents the depth of an
element of the branch considered). Finally, the quantity of
change is measured by computing the sum of the weights of
the nodes that appear in the symmetric diference set.
        </p>
      </sec>
    </sec>
    <sec id="sec-19">
      <title>STATISTICALLY ESTIMATING CHANGE 6. 6.1</title>
    </sec>
    <sec id="sec-20">
      <title>Motivating estimative models</title>
      <p>We have seen the multitude of domains that are interested
in the change detection issue. The general aim is to get an
idea about the dynamics of a certain type of content, at a
given granularity.</p>
      <p>In Web crawler-related applications, the interest is more
in whether a Web page has changed or not, in order to know
if a new version of a Web page shall be downloaded or not.
Only a binary response is needed; this does not happen
because it is not interesting to make a distinction between
the diferent types of changes that can occur, but because
current crawlers treat information at Web page level. In this
case, an estimation of the change frequency is as efective as
explicitly computing it, as we show in Section 3. Indeed, if
archive crawlers were more aware of the semantics of data
they process, they could clearly benefit of a broader, richer
insight into the data and could develop diferent strategies
related to storage and processing. An estimation of the
change rate, although not very descriptive (we usually do not
know where the change appeared or its type), is still useful,
especially when we can imagine a strategy that combines
estimative and comparative methods of deriving dynamics.
For this reason, we shortly present some of the existing
statistical approaches.
6.2</p>
    </sec>
    <sec id="sec-21">
      <title>Poisson model</title>
      <p>
        The study carried out in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] reports the fact that changes
that occur in Web pages can be modeled as a Poisson process.
A Poisson process is used to model a sequence of random
events that happen independently with a fixed rate over time.
Based on a(n ideally complete) change history, the frequency
of change is estimated. The time-independence assumption
in homogeneous models [
        <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
        ] does not really capture the
reality of the Web. The authors of [
        <xref ref-type="bibr" rid="ref33">33</xref>
        ] afirm that in the
case of dynamic Web content (like blogs), the posting rates
vary very much depending on diferent parameters. Hence,
they propose an inhomogeneous Poisson model (which do not
assume that events happen independently); this model learns
the posting patterns of Web pages and predicts a recheck for
new content.
      </p>
      <p>
        The work in [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] formalizes some use cases where, by
adapting the parameters of the Poisson model to the requirements
of the application, a better accuracy of the estimation can
be achieved. The situation when we do not have a complete
change history of a Web page is treated, which is actually
the real world case. Sometimes, we have only the last date of
change or we at best just know that a change has occurred.
Contributions are also related to the fact that the authors
adapt the canonical model to interesting applications and
feed diferent estimators, improving thus the technique for
the considered cases.
6.3
      </p>
    </sec>
    <sec id="sec-22">
      <title>Kalman filters</title>
      <p>
        Other statistical approach to change detection is that of
[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Here, the textual vector space model is employed to
identify the patterns of the page and to train Kalman filters
with these patterns. In the end, the change represents the
event that does not match the prediction. The possible
disadvantages of this method is that it assumes the linearity
of the system (because of the the Kalman filter model, which
is doing an exact inference in a linear dynamical system)
and uses an incomplete vector space model (for complexity
reasons).
      </p>
    </sec>
    <sec id="sec-23">
      <title>OPEN QUESTIONS</title>
      <p>We end this article by discussing several issues that have
not been addressed yet and which are related to the
deriving of temporal properties of Web pages. The selection of
subjects reflects our personal vision on the topic.</p>
      <sec id="sec-23-1">
        <title>Further studies on timestamping.</title>
        <p>With the large number of sources of timestamping hints,
it should indeed be possible to estimate the freshness of
a Web page. An experimental study on the reliability of
these sources, perhaps in specific contexts (a given Web
server software, a given CMS, etc.), which could provide
more insight into optimal strategies for timestamping Web
pages, is still to be carried out.</p>
      </sec>
      <sec id="sec-23-2">
        <title>Relevant change detection.</title>
        <p>
          An important question when detecting changes is whether
the changes are relevant to the interests or needs of the user
or archivist. Web users are exposed to many advertisements
and content becomes pre-fabricated, put in containers and
delivered in a targeted way, so much more attention should
be paid to the relevance factor. Additionally, Web pages have
some components that are more dynamic than others; it is not
possible to say if the dynamics come from irrelevant content
(think of ads changing at every request) or precisely because
the content is very informative (e.g. frequently updated
news). To make the distinction between these, techniques
that define and extract semantics (e.g., by looking at the
topic and linked data) might be used as a filtering method
or just for adding more significance to the change detection
results. The cleaning of the Web page or its segmentation into
semantic blocks is of interest in many information extraction
ifelds and, although we mention only [
          <xref ref-type="bibr" rid="ref21 ref29 ref40">40, 21, 29</xref>
          ], there exists
a large number of works that treat this subject.
        </p>
        <p>
          As an observation, there exists a subtle diference in the
use of the term “meaningful” in the various works that we
have studied. While some works [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ] use it to emphasize
the fact that more types of changes are detected, other
approaches [
          <xref ref-type="bibr" rid="ref30">30</xref>
          ] use it as synonym to “relevant”, from the
content point of view. Vi-Dif [
          <xref ref-type="bibr" rid="ref30">30</xref>
          ] uses the VIPS algorithm [
          <xref ref-type="bibr" rid="ref40">40</xref>
          ]
to get from a Web page an hierarchy of semantic blocks and
detect changes only from this perspective, hopefully ignoring
all boilerplate content. However, a deeper insight into the
relevance aspect is mentioned as future work; the authors
of [
          <xref ref-type="bibr" rid="ref30">30</xref>
          ] talk about using machine learning techniques for this
issue, which would be an interesting line of research. Usually,
heuristics [
          <xref ref-type="bibr" rid="ref40">40</xref>
          ] are employed to get a measure of relevance
because having a generic source of knowledge which can be
automatically used for this task is very dificult.
        </p>
        <p>
          Recently, [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ] used the concepts that can be extracted from
the description of Web feeds to get the content of interest
from Web pages. However, this kind of semantics can only be
obtained in the case of Web pages that have feeds associated.
        </p>
        <p>With the emergence of the Semantic Web, we envision new
ways of distinguishing timestamps or of filtering irrelevant
content from Web pages and therefore more eficient methods
for deriving dynamics of Web pages.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Abiteboul</surname>
          </string-name>
          .
          <article-title>Issues in monitoring web data</article-title>
          .
          <source>In Proc. DEXA</source>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>E.</given-names>
            <surname>Adar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Teevan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. T.</given-names>
            <surname>Dumais</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J. L.</given-names>
            <surname>Elsas</surname>
          </string-name>
          .
          <article-title>The Web changes everything: Understanding the dynamics of Web content</article-title>
          .
          <source>In Proc. WSDM</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>H.</given-names>
            <surname>Artail</surname>
          </string-name>
          and
          <string-name>
            <given-names>K.</given-names>
            <surname>Fawaz</surname>
          </string-name>
          .
          <article-title>A fast HTML Web change detection approach based on hashing and reducing the number of similarity computations</article-title>
          .
          <source>Data Knowl. Eng.</source>
          ,
          <volume>66</volume>
          (
          <issue>2</issue>
          ),
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>R.</given-names>
            <surname>Baeza-Yates</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Castillo</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Saint-Jean</surname>
          </string-name>
          .
          <article-title>Web dynamics, structure, and page quality</article-title>
          . In M.
          <article-title>Levene and A</article-title>
          . Poulovassilis, editors,
          <source>Web Dynamics</source>
          . Springer,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>D. T.</given-names>
            <surname>Barnard</surname>
          </string-name>
          , G. Clarke, and
          <string-name>
            <given-names>N.</given-names>
            <surname>Duncan</surname>
          </string-name>
          .
          <article-title>Tree-to-tree correction for document trees</article-title>
          .
          <source>Technical Report 95-372</source>
          , Queen's University, Kingston, Ontario, Canada,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>D. P.</given-names>
            <surname>Bertsekas</surname>
          </string-name>
          and
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Castañon</surname>
          </string-name>
          .
          <article-title>Parallel asynchronous Hungarian methods for the assignment problem</article-title>
          .
          <source>INFORMS J. Computing</source>
          ,
          <volume>5</volume>
          (
          <issue>3</issue>
          ),
          <year>1993</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>P. L.</given-names>
            <surname>Bogen</surname>
          </string-name>
          ,
          <string-name>
            <surname>II</surname>
          </string-name>
          , J. Johnson, U. Karadkar,
          <string-name>
            <given-names>R.</given-names>
            <surname>Furuta</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F. M.</given-names>
            <surname>Shipman</surname>
          </string-name>
          , III.
          <article-title>Application of Kalman filters to identify unexpected change in blogs</article-title>
          .
          <source>In Proc. JCDL</source>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A. Z.</given-names>
            <surname>Broder</surname>
          </string-name>
          .
          <article-title>On the resemblance and containment of documents</article-title>
          .
          <source>In Proc. SEQUENCES</source>
          ,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Chawathe</surname>
          </string-name>
          and
          <string-name>
            <given-names>H.</given-names>
            <surname>Garcia-Molina</surname>
          </string-name>
          .
          <article-title>Meaningful change detection in structured data</article-title>
          .
          <source>In Proc. SIGMOD</source>
          ,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J.</given-names>
            <surname>Cho</surname>
          </string-name>
          and
          <string-name>
            <given-names>H.</given-names>
            <surname>Garcia-Molina</surname>
          </string-name>
          .
          <article-title>The evolution of the Web and implications for an incremental crawler</article-title>
          .
          <source>In Proc. VLDB</source>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J.</given-names>
            <surname>Cho</surname>
          </string-name>
          and
          <string-name>
            <given-names>H.</given-names>
            <surname>Garcia-Molina</surname>
          </string-name>
          .
          <article-title>Estimating frequency of change</article-title>
          .
          <source>ACM TOIT</source>
          ,
          <volume>3</volume>
          (
          <issue>3</issue>
          ),
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>L. R.</given-names>
            <surname>Clausen</surname>
          </string-name>
          .
          <article-title>Concerning Etags and datestamps</article-title>
          .
          <source>In Proc. IWAW</source>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>G.</given-names>
            <surname>Cobéna</surname>
          </string-name>
          and
          <string-name>
            <given-names>T.</given-names>
            <surname>Abdessalem</surname>
          </string-name>
          .
          <article-title>A comparative study of XML change detection algorithms</article-title>
          .
          <source>In Service and Business Computing Solutions with XML. IGI Global</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>G.</given-names>
            <surname>Cobéna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Abiteboul</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Marian</surname>
          </string-name>
          .
          <article-title>Detecting changes in XML documents</article-title>
          .
          <source>In Proc. ICDE</source>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>F.</given-names>
            <surname>Douglis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ball</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.-F.</given-names>
            <surname>Chen</surname>
          </string-name>
          , and
          <string-name>
            <surname>E. Koutsofios. The AT</surname>
          </string-name>
          &amp;
          <article-title>T Internet diference engine: Tracking and viewing changes on the Web</article-title>
          .
          <source>World Wide Web</source>
          ,
          <volume>1</volume>
          (
          <issue>1</issue>
          ),
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>D.</given-names>
            <surname>Fetterly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Manasse</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Najork</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Wiener</surname>
          </string-name>
          .
          <article-title>A large-scale study of the evolution of Web pages</article-title>
          .
          <source>In Proc. WWW</source>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>S.</given-names>
            <surname>Flesca</surname>
          </string-name>
          and
          <string-name>
            <given-names>E.</given-names>
            <surname>Masciari</surname>
          </string-name>
          .
          <article-title>Eficient and efective Web page change detection</article-title>
          .
          <source>Data Knowl. Eng.</source>
          ,
          <volume>46</volume>
          (
          <issue>2</issue>
          ),
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>J.</given-names>
            <surname>Jacob</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sanka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Pandrangi</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Chakravarthy. WebVigiL</surname>
          </string-name>
          :
          <article-title>An approach to just-in-time information propagation in large network-centric environments</article-title>
          . In M.
          <article-title>Levene and A</article-title>
          . Poulovassilis, editors,
          <source>Web Dynamics</source>
          . Springer,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>A.</given-names>
            <surname>Jatowt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kawai</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Tanaka</surname>
          </string-name>
          .
          <article-title>Detecting age of page content</article-title>
          .
          <source>In Proc. WIDM</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>H. P.</given-names>
            <surname>Khandagale</surname>
          </string-name>
          and
          <string-name>
            <given-names>P. P.</given-names>
            <surname>Halkarnikar</surname>
          </string-name>
          .
          <article-title>A novel approach for web page change detection system</article-title>
          .
          <source>Intl. J. Comput. Theory Eng.</source>
          ,
          <volume>2</volume>
          (
          <issue>3</issue>
          ),
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>C.</given-names>
            <surname>Kholschutter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Fankhauser</surname>
          </string-name>
          , and
          <string-name>
            <given-names>W.</given-names>
            <surname>Nejdi</surname>
          </string-name>
          .
          <article-title>Boilerplate detection using shallow text features</article-title>
          .
          <source>In Proc. WSDM</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>I. Khoury</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. M.</given-names>
            <surname>El-Mawas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>El-Rawas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. F.</given-names>
            <surname>Mounayar</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Artail</surname>
          </string-name>
          .
          <article-title>An eficient Web page change detection system based on an optimized Hungarian algorithm</article-title>
          .
          <source>IEEE TKDE</source>
          ,
          <volume>19</volume>
          (
          <issue>5</issue>
          ),
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>S.-J.</given-names>
            <surname>Lim</surname>
          </string-name>
          and
          <string-name>
            <given-names>Y.-K.</given-names>
            <surname>Ng</surname>
          </string-name>
          .
          <article-title>An automated change-detection algorithm for HTML documents based on semantic hierarchies</article-title>
          .
          <source>In Proc. ICDE</source>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>L.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Pu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>W.</given-names>
            <surname>Tang</surname>
          </string-name>
          .
          <article-title>WebCQ: Detecting and delivering information changes on the Web</article-title>
          .
          <source>In Proc. CIKM</source>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>J.</given-names>
            <surname>Masanès</surname>
          </string-name>
          , editor.
          <source>Web Archiving</source>
          . Springer-Verlag,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <surname>Netcraft</surname>
          </string-name>
          .
          <article-title>January 2011 Web survey</article-title>
          . http://news.netcraft.com/archives/2011/01/12/ january-2011
          <string-name>
            <surname>-</surname>
          </string-name>
          web
          <article-title>-server-survey-4</article-title>
          .html,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ntoulas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Cho</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Olston</surname>
          </string-name>
          .
          <article-title>What's new on the Web? The evolution of the Web from a search engine perspective</article-title>
          .
          <source>In Proc. WWW</source>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>S.</given-names>
            <surname>Nunes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Ribeiro</surname>
          </string-name>
          , and
          <string-name>
            <surname>G. David.</surname>
          </string-name>
          <article-title>Using neighbors to date Web documents</article-title>
          .
          <source>In Proc. WIDM</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>M.</given-names>
            <surname>Oita</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Senellart</surname>
          </string-name>
          .
          <article-title>Archiving data objects using web feeds</article-title>
          .
          <source>In Proc. IWAW</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Pehlivan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. Ben</given-names>
            <surname>Saad</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Gançarski</surname>
          </string-name>
          .
          <article-title>A novel Web archiving approach based on visual pages analysis</article-title>
          .
          <source>In Proc. IWAW</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>D.</given-names>
            <surname>Rocco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Buttler</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Liu</surname>
          </string-name>
          .
          <article-title>Page digest for large-scale Web services</article-title>
          .
          <source>In Proc. CEC</source>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>S. C.</given-names>
            <surname>Sahinalp</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Utis</surname>
          </string-name>
          .
          <article-title>Hardness of string similarity search and other indexing problems</article-title>
          .
          <source>In Proc. ICALP</source>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <surname>K. C. Sia</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Cho</surname>
          </string-name>
          , and H.
          <string-name>
            <surname>-K. Cho</surname>
          </string-name>
          .
          <article-title>Eficient monitoring algorithm for fast news alerts</article-title>
          .
          <source>IEEE TKDE</source>
          ,
          <volume>19</volume>
          (
          <issue>7</issue>
          ),
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>K.</given-names>
            <surname>Sigurðsson</surname>
          </string-name>
          .
          <article-title>Incremental crawling with Heritrix</article-title>
          .
          <source>In Proc. IWAW</source>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          <article-title>[35] sitemaps.org. Sitemaps XML format</article-title>
          . http://www.sitemaps.org/protocol.php,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          <source>[36] W3C. HTML 4</source>
          .01 specification. http://www.w3.org/TR/REC-html40/,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <surname>D. J. DeWitt</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.-Y.</given-names>
            <surname>Cai. X-Dif</surname>
          </string-name>
          :
          <article-title>An efective change detection algorithm for XML documents</article-title>
          .
          <source>In Proc. ICDE</source>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          [38]
          <string-name>
            <given-names>D.</given-names>
            <surname>Yadav</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Sharma</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Gupta</surname>
          </string-name>
          .
          <article-title>Change detection in Web pages</article-title>
          .
          <source>In Proc. ICIT</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          [39]
          <string-name>
            <given-names>D.</given-names>
            <surname>Yadav</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Sharma</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Gupta</surname>
          </string-name>
          .
          <article-title>Parallel crawler architecture and web page change detection</article-title>
          .
          <source>WSEAS Transactions on Computers</source>
          ,
          <volume>7</volume>
          (
          <issue>7</issue>
          ),
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          [40]
          <string-name>
            <given-names>S.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Cai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-R.</given-names>
            <surname>Wen</surname>
          </string-name>
          , and W.-Y. Ma.
          <article-title>Improving pseudo-relevance feedback in Web information retrieval using Web page segmentation</article-title>
          .
          <source>In Proc. WWW</source>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>