<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Deriving Dynamics of Web Pages: A Survey *</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Marilena</forename><surname>Oita</surname></persName>
							<email>marilena.oita@telecom-paristech.fr</email>
							<affiliation key="aff0">
								<orgName type="institution" key="instit1">INRIA Saclay -Île-de-France</orgName>
								<orgName type="institution" key="instit2">Télécom ParisTech</orgName>
								<orgName type="institution" key="instit3">CNRS LTCI Paris</orgName>
								<address>
									<country key="FR">France</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Pierre</forename><surname>Senellart</surname></persName>
							<email>pierre.senellart@telecom-paristech.fr</email>
							<affiliation key="aff1">
								<orgName type="department">Institut Télécom Télécom ParisTech</orgName>
								<orgName type="institution">CNRS LTCI Paris</orgName>
								<address>
									<country key="FR">France</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Deriving Dynamics of Web Pages: A Survey *</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">4F65894E71F8227B33D6F2115C0A3F39</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T19:56+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>H.3.5 [Information Storage and Retrieval]: Online Information Services-Web-based services Algorithms</term>
					<term>Experimentation</term>
					<term>Measurement Change monitoring</term>
					<term>Web archiving</term>
					<term>Timestamping</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>The World Wide Web is dynamic by nature: content is continuously added, deleted, or changed, which makes it challenging for Web crawlers to keep up-to-date with the current version of a Web page, all the more so since not all apparent changes are significant ones. We review major approaches to change detection in Web pages and extraction of temporal properties (especially, timestamps) of Web pages. We focus our attention on techniques and systems that have been proposed in the last ten years and we analyze them to get some insight into the practical solutions and best practices available. We aim at providing an analytical view of the range of methods that can be used, distinguishing them on several dimensions, especially, their static or dynamic nature, the modeling of Web pages, or, for dynamic methods relying on comparison of successive versions of a page, the similarity metrics used. We advocate for more comprehensive studies of the effectiveness of Web page change detection methods, and finally highlight open issues.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">INTRODUCTION</head><p>The World Wide Web challenges our capacity to develop tools that can keep track of the huge amount of information that is getting modified at speed rate. Web archiving crawlers <ref type="bibr" target="#b25">[25]</ref>, especially, need the ability to detect change in Web content and to infer when a Web page last changed. This ability is fundamental in maintaining the coherence of the crawl, in adjusting its refresh rate, in versioning, and to allow a user to retrieve meaningful temporal data. The understanding of the dynamics of Web pages, that is, how fast the Web content changes and what the nature of these changes is, its implications on the structure of the Web, and the correlation with the topic of pages are also popular subjects in the research literature <ref type="bibr" target="#b4">[4]</ref>.</p><p>In addition to being of paramount importance to Web archiving, the subject of change detection is of interest in various applications and domains, such as: large-scale information monitoring and delivery systems <ref type="bibr" target="#b18">[18,</ref><ref type="bibr" target="#b15">15,</ref><ref type="bibr" target="#b24">24,</ref><ref type="bibr" target="#b17">17,</ref><ref type="bibr" target="#b23">23]</ref> or services <ref type="foot" target="#foot_0">1</ref> , Web cache improvement <ref type="bibr" target="#b11">[11]</ref>, version configuration and management of Web archives <ref type="bibr" target="#b29">[30]</ref>, active databases <ref type="bibr" target="#b18">[18]</ref>, servicing of continuous queries <ref type="bibr" target="#b1">[1]</ref>. Research has focused on finding novel techniques for comparing snapshots of Web pages (a reference Web page and its updated version) in order to detect change and estimate its frequency. Change can however be detected at various levels: there are various aspects of dynamics which must be considered when studying how Web content changes and evolves.</p><p>The majority of works have seen the change detection problem from a document-centric perspective, as opposed to an object-centric one. By object or entity we mean here any Web content, part of a Web page, that represents meaningful information per se: image, news article, blog post, comment, etc.. Comparatively, little effort has been put on making the difference between relevant changes and those that might occur because of the dynamic template of a Web page (ads, active layout, etc.), i.e., its boilerplate <ref type="bibr" target="#b21">[21]</ref>.</p><p>We study in this article some of the strategies that have been established in different settings, with the aim at providing an overview of the existing techniques used to derive temporal properties of Web pages.</p><p>There is a large body of work on the related problem of change detection in XML documents, particularly for purposes of data integration and update management in XML-centric databases. However, the solutions developed for XML documents cannot be applied without serious revisions for HTML pages. The model assumptions made for XML do not really hold for HTML. Indeed, the HTML and XML formats have a key difference: while an XML page defines the nature of the content by its meta tags, HTML tags define mainly presentational aspects of content within. In addition to the challenges that exist in XML documents, Web pages add some others by their lack of formal semantics, a fuzziness regarding the formatting, by the embedding of multimedia and scripts, etc.. Separate approaches commonly need to be adopted and the research done on XML documents is beyond the scope of our paper, although we mention some works on XML <ref type="bibr" target="#b36">[37,</ref><ref type="bibr" target="#b14">14]</ref> that have particular relevance to Web page change detection. We refer the reader to <ref type="bibr" target="#b13">[13]</ref> for a survey of XML change detection algorithms.</p><p>We focus in this survey on deriving dynamics of HTML documents.</p><p>Change detection mechanisms can either be static, estimating the date of last modification of content from the Web page itself (its code, semantics or neighbors), or dynamic, by comparing successive versions of a Web page. The structure of this article reflects these dimensions. In Section 2, we present static approaches to timestamping Web pages, while Section 3 introduce dynamic methods. We then analyze in Section 4 the different models of a Web page used by existing techniques for comparing successive versions. Similarity metrics used in dynamic methods are independently investigated in Section 5. We briefly describe statistical modeling approaches to estimate change frequency in Section 6. We conclude with a discussion of some remaining open questions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">STATIC APPROACHES: TIMESTAMPING</head><p>This section deals with methods for inferring temporal properties of a Web page in a static manner as opposed to the commonly used dynamic computation of the difference between successive versions of a given Web page. The goal here is to infer the creation or the last modification date of a Web page or, possibly, of some parts of it. We study sources of data that can be useful for that purpose: metadata, the content of the Web page itself, or its graph neighborhood. The canonical way for timestamping a Web page is to use the Last-Modified HTTP header. Unfortunately, studies have shown this approach is not reliable in general <ref type="bibr" target="#b12">[12]</ref>. We describe next why this happens in practice and other techniques for timestamping Web pages.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">HTTP metadata</head><p>HTTP/1.1, the main protocol used by Web clients and servers to exchange information, offers several features of interest for timestamping, the foremost of which are the ETag and Last-Modified HTTP response headers. Entity tags (or ETags) are unique identifiers for a given version of a particular document. They are supposed to change if and only if the document itself changes. Servers can return this with any response, and clients can use the If-Match and If-None-Match HTTP requests headers to condition the retrieval of the document to a change in the ETag value, avoiding then to retrieve already known contents. If-Modified-Since and If-Unmodified-Since provide conditional downloading features, in a similar way as for ETags. Even when conditional downloading is not possible, Etags and HTTP timestamps can be retrieved by a Web client without downloading a whole Web page by making use of the head HTTP method. The problem is that while this information is generally provided and is very reliable for static content (e.g., static HTML pages or PDF), it is most of the time missing or changed at every request (the timestamp given is that of the request, not of the content change) when the content is dynamic (gen-erated by content management systems, etc.). Some CMSs do return correct HTTP timestamps, such as MediaWiki<ref type="foot" target="#foot_1">2</ref> , but they seem to be a minority.</p><p>In <ref type="bibr" target="#b12">[12]</ref>, Clausen presents an experimental study of the reliability of Etags and HTTP timestamps on a collection of a few million Danish Web pages. He finds out that the best strategy for avoiding useless downloads of versions of Web pages already available in a Web archive is to always download when the ETag server is missing, and otherwise download only if the Last-Modified header indicates change. This rather counterintuitive result yielded in this experimental study an almost perfect prediction of change, and a 63% accuracy in predicting non-change. Given that the majority of Web servers run some version of the open-source Apache<ref type="foot" target="#foot_2">3</ref> HTTP server <ref type="bibr">[26]</ref>, it would be interesting to see whether this strategy is correlated with some inherent behavior of this software. Furthermore, repeating this experiment on a larger scale and with a more recent set of Web pages would be of interest.</p><p>HTTP also provides the Cache-Control and Expires response headers. This information is often given, but with a zero or very low expiration delay, which means that nothing interesting can be derived from it. In some specific and controlled environments (e.g., Intranets), it might still be useful to look at these two pieces of information to estimate the refresh rate of a Web page.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Timestamps in Web content</head><p>Content management systems, as well as Web authors, often provide in the content of a Web page some humanreadable information about its last date of modification. This can be a global timestamp (for instance, preceded by a "Last modified" string, in the footer of a Web page) or a set of timestamps for individual items in the page, such as news stories, blog posts, comments, etc. In the latter case, the global timestamp might be computed as the maximum of the set of individual timestamps. It is actually quite easy to extract and recognize such information, with keyword selection (last, modification, date, etc.) or with entity recognizers for dates (built out of simple regular expressions). However, this timestamp is often quite informal and partial: there is sometimes no time indication, and most of the time no timezone. To the best of our knowledge, no formal study of the precision reached by extracting timestamps from Web content has been carried out.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3">Semantic temporal associations</head><p>In addition to these timestamps provided to humans, documents on the Web may include additional semantic timestamps meant for machines. No mechanism for this exists in HTML per se, but the HTML specification <ref type="bibr" target="#b35">[36]</ref> allows for arbitrary metadata in the form of &lt;meta&gt; tags, one particular profile of such metadata being Dublin Core 4 whose modified term indicates the date of last modification of a Web page. Both content management systems and Web authors occasionally use this possibility. Web feeds (in RSS or Atom formats) also have semantic timestamps, which are quite reliable, since they are essential to the working of applications that exploit them, such as feed readers. In some cases, external semantic content can be used for dating an HTML Web page: for instance, an RSS feed containing blog entries can be mapped to the corresponding Web page, in order to date individual items <ref type="bibr" target="#b28">[29]</ref>. Another case is that of sitemaps <ref type="bibr" target="#b34">[35]</ref>. Sitemaps are files that can by provided by the owner of a Web site to describe its organization, so as to improve its indexing by Web search engines. Sitemaps allow for both timestamps and change rate indications (hourly, monthly, etc.), but these features are not often used. Very few content management systems produce all of this, although it would have been the ideal case: the download of a single file would suffice to get all timestamping information about the whole Web site.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.4">Using the neighborhood</head><p>It is possible to use the graph structure of the Web to help timestamping Web pages: <ref type="bibr" target="#b27">[28]</ref> uses the neighboring pages of a Web page to estimate its timestamp. When no source of reliable timestamps is found for a given page using one of the technique described above, its timestamp is set to some form of average of the timestamps of pages pointed to and by this page. The inherent assumption is that pages linked together tend to have a similar update patterns. The precision is not very high, but better than nothing when no other information is available.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">DYNAMIC METHODS</head><p>When static techniques do not give adequate results, there is still the possibility of comparing a Web page with its previous version in order to determine (sometimes in a rough way) an equivalent of Last-Modified. The timestamp that gives the last modification of a Web page can be inferred then as the date when change has been detected.</p><p>Nevertheless, for a precise estimation, a just-in-time crawl of versions is needed. In reality, the frequency of change is quite difficult to estimate because Web pages have different patterns of change. In general, many factors determine variations in the frequency of change for a given Web page: the CMS, the subject, the time of the year, even the time of the day, etc.</p><p>Estimating the Web pages' frequency of change is the subject of many studies <ref type="bibr" target="#b16">[16,</ref><ref type="bibr" target="#b2">2,</ref><ref type="bibr" target="#b26">27,</ref><ref type="bibr" target="#b10">10,</ref><ref type="bibr" target="#b19">19]</ref>. Their results are however heavily dependent on the technique of detecting change that they have used.</p><p>There are two parameters that influence the process of determining the dynamics:</p><p>1. the frequency of change is not known in advance, but if we do not crawl the Web pages on time, we miss versions and the timestamp detected will be then imprecise;</p><p>2. the change detection technique, which heavily relies on the similarity metrics and on the model (that can capture more or less types of changes) and the filtering of dynamic elements that influence the frequency without being truly relevant (and which occur quite often in Web pages because of AJAX applications or advertisements).</p><p>A method of filtering irrelevant content is to know what is important rather than trying to filter what is unimportant. If we knew in advance the frequency of crawl, then we would know when change occurs and therefore set timestamps for new versions. This is unfortunately not the case, but we need however to crawl frequently enough (better too frequently than not frequent enough). Once we have these versions, we can detect if there is any change that happened in the interval of time that represents the interval of crawl. Based on this, timestamps can be derived with good approximation.</p><p>The majority of works consider that they have an access to the versions and they focus on detecting change efficiently. Few make a semantic distinction between changes by disregarding insignificant ones. Next, we present some general notions about changes: what kind of changes can occur in Web pages, which have been identified in our studied approaches and which have not, and finally we give an insight into how change is actually represented.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Types of changes</head><p>We summarize here the types of changes detected by different approaches. There are however other, more sophisticated types that are sometimes mentioned, but not treated. For instance, behavioral changes are mentioned in <ref type="bibr" target="#b37">[38]</ref>; they occur in active HTML elements like scripts, embedded applications and multimedia. These new forms of Web content have a big impact on the Web today, but they require a more complex modeling.</p><p>All considered approaches detect changes in content.</p><p>Works that model the HTML page as a tree, including page digest encoding, usually detect also structural and attribute changes. However, differences exist in the coverage of cases for these types of changes. For example, MH-Diff <ref type="bibr" target="#b9">[9]</ref> detects also move and copy structural changes, which is an improvement over the traditional detection of insert, delete and update. Structural (or layout) changes occur when the position of elements in the page is modified. Attribute (or presentation) changes are related to the representation of information, for instance changes in the font, colors or captions. For capturing structural and attribute changes, the model has to be aware their existence; this implies a more complex model which influences generally in a negatice manner the performance. Unlike flat-file models of Web pages, the output of content change detection in hierarchical models is more meaningful: the type of node in which the content change occured can also be identified.</p><p>There are also type changes, that are modifications which come about when the HTML tags change: e.g., a p tag which becomes a div. Type changes can be detected by <ref type="bibr" target="#b30">[31]</ref> which uses the page digest encoding that provides a mechanism for locating nodes of a particular type (see further).</p><p>Semantic types of change capture the meaning of content that has changed. They are defined in SCD <ref type="bibr" target="#b23">[23]</ref>, a pioneer work in this direction.</p><p>Changes are sometimes captured in a quantitative manner rather than in a qualitative one. In contrast with the qualitative way, where the change is described (in a delta file) or visualized in a comparative manner, quantitative approaches estimate the amount of change of a specific type. More specifically, all approaches that use the similarity formula defined in CMW <ref type="bibr" target="#b17">[17]</ref> do not reconstruct the complete sequence of changes, but give a numerical value of it. In this case, supposing a threshold of change, we can determine if a page has changed or not, which actually represent more an oracle response, still useful in the majority of applications.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">The representation of change</head><p>There are various ways to present the difference between TWAW 2011, Hyderabad, India two documents. Changes are usually stored in a physical structure generically called delta file or delta tree. The format of storing the change for RMS <ref type="bibr" target="#b37">[38]</ref> consists in specialized set of arrays that capture the relationships among the nodes and the changes that occur both in structure and content. Systems for monitoring change like <ref type="bibr" target="#b24">[24,</ref><ref type="bibr" target="#b18">18]</ref> have typically a user interface and present changes in a graphical way. HTMLdiff merge the input Web page versions into one document that will summarize all the common parts and also the changed ones. The advantage is that the common parts are displayed just once, but on the other hand, the resulting merged HTML can be syntactically or semantically incorrect. Another choice linked to change presentation is to display only the differences and omit the common parts of the two Web pages. When the documents have a lot of data in common, presenting only the differences could be better, with the drawback that the context is missing. The last approach is to present the differences between the old and new version side by side.</p><p>These presentation modes are used in combination, rather than being the unique choice for a given system. For example, <ref type="bibr" target="#b24">[24,</ref><ref type="bibr" target="#b18">18]</ref> are presenting the results of the change monitoring service using a hybrid approach: presentation modes are combined depening on the type of change that is presented.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">HTML DOCUMENT MODELS</head><p>This section contains an overview of the models that are considered in the quest of detecting changes in Web documents. The modeling step is a key one as it will determine the elements on which the comparison algorithms operate.</p><p>We first discuss the "naïve" approach, that is, to consider Web pages as flat files (strings); then we describe tree models, a popular choice in the literature. We explore also some approaches that are based on tree models, but which are essentially transforming the two versions to be compared in a bipartite graph on which specific algorithms are applied. Finally, we present the Page Digest design of Web pages, a manner of encoding that clearly separates structural elements of Web documents from their content, while remaining highly compact.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Flat-files</head><p>Some early change detection systems model Web pages as flat files <ref type="bibr" target="#b15">[15]</ref>. As these models do not take into account the hierarchical structure of HTML documents and neither the characteristics of the layout, they can detect only content changes -and this without making any semantic difference in the content.</p><p>Some works <ref type="bibr" target="#b33">[34]</ref> try to filter first irrelevant content by using heuristics on the type of content and regular expressions. After this basic filtering, the Web page content is direcly hashed and compared between versions. Unfortunately, we can never filter all kind of inconvenient content, especially when its manner of encoding or type get more complex.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Trees</head><p>A natural approach is to represent a Web page as a tree using the DOM model. By taking into account the hierarchies, structural and attribute changes can be detected besides content changes.</p><p>Differences between tree models appear in their ordered characteristics and in the level of granularity on which change is detected: node, branch, subtree or "object". We will further discuss these aspects.</p><p>First, the modeling of a Web page into a tree requires a preprocessing step of cleaning. This is a significant one because it corrects missing or mismatching, out-of-order end tags, as well as all other syntactic ill-formedness of the HTML document. A tree is constructed first by filtering the "tag soup" HTML into an XML document and second, by manipulating the result in order to obtain a workable tree structure using implementations of the DOM standard (that usually employ XSLT and XPath). Sometimes also an initial pruning of elements is done: <ref type="bibr" target="#b22">[22]</ref> is filtering out scripts, applets, embedded objects and comments.</p><p>Many works do not specify how they realize the cleaning, so either they assume it to be done in advance, or they solve this issue by using an HTML cleaning tool. HTML Tidy<ref type="foot" target="#foot_4">5</ref> is well-suited for this purpose and mentioned in <ref type="bibr" target="#b3">[3]</ref>.</p><p>There are however cases <ref type="bibr" target="#b23">[23]</ref> when the tree model does not need to be cleaned: as the semantics of tags (value, name) is leveraged in the technique, the structure does not have to be enforced.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Ordered trees.</head><p>The ordered characteristic of trees implies that the order of appearance of nodes is considered in the algorithm and therefore included in the model.</p><p>RMS algorithm <ref type="bibr" target="#b37">[38]</ref> stores the level (depth) of a node, information which will be used in the tree traversal and parsing. The SCD <ref type="bibr" target="#b23">[23]</ref> algorithm uses branches of ordered trees to detect semantic changes. The notion of branch is used to give the context of a node and is formalized as an ordered multiset, in which the elements are designated by the tag name of the HTML non-leaf node, or its content (text), if it represents a leaf node. The order is very important in this model because of the data hierarchy considered (e.g., book.author.name.Eminescu vs. a markup hierarchy like div.p.b.PCDATA). In this model, if we change the order, we change the semantics of hierarchies or this semantics becomes incoherent.</p><p>Ordered tree model is also used in <ref type="bibr" target="#b20">[20]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Unordered trees.</head><p>The unordered labeled tree model does not consider the order of appearance of elements in the tree as relevant, instead only the parent-child relationships are captured.</p><p>It is mentioned in MHDiff <ref type="bibr" target="#b9">[9]</ref> that the change detection problem for unordered trees is harder that for ordered ones. Like <ref type="bibr" target="#b17">[17,</ref><ref type="bibr" target="#b9">9]</ref>, <ref type="bibr" target="#b22">[22]</ref> constructs a weighted bipartite graph from the two trees given as entry. In these models, the order has not an importance on the final structure for which further processing will be done, therefore this feature is not captured.</p><p>[3] delimits and encodes subtrees; these are generated by analyzing the level of a node in the unordered tree. From the root, when we arrive at level % 3, a subtree is created; the node at (level % 3 + 1) becomes the local root of the following subtree, and this iteratively until the end of the hierarchy. Each subtree is marked with the tag name of its local root and indexed based on its start and end node identifiers. A hashtable will finally map metadata about nodes (like tag name, content, path to the root, attributes pair values, etc.).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Bipartite graph</head><p>The bipartite graph model is a model derived from unordered trees: it consists in two independent sets (acquired after tree pruning), each corresponding to a version, that are connected by cost edges. As set elements, subtrees are chosen over nodes in <ref type="bibr" target="#b22">[22,</ref><ref type="bibr" target="#b17">17]</ref>. The assumption <ref type="bibr" target="#b17">[17]</ref> is that we might be more interested in which subtree the change occurs, than in which specific node.</p><p>The cost of an edge represent the cost of the edit scripting needed to make a model entity (node or subtree) of the first set isomorphic with one of the second set. The similarities between all subtrees in the first tree and all those of the second tree are computed and placed in a cost matrix. Having this matrix, the Hungarian <ref type="bibr" target="#b6">[6]</ref> algorithm is used to find in polynomial time a minimum-cost bijection between the two partitions. This algorithm is typically used in linear programming to find the optimal solution to the assignment problem.</p><p>CMW <ref type="bibr" target="#b17">[17]</ref> as well as MH-Diff <ref type="bibr" target="#b9">[9]</ref> algorithm, are based on transforming the change detection problem in one of computing a minimum cost edge cover on a bipartite graph. Optimizations are brought out in <ref type="bibr" target="#b22">[22]</ref> in comparison with the work in <ref type="bibr" target="#b6">[6]</ref>, but the general aim and similarity metrics remain the same as in <ref type="bibr" target="#b3">[3]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4">Page Digest</head><p>The page digest model<ref type="foot" target="#foot_5">6</ref> has been adopted in <ref type="bibr" target="#b30">[31]</ref> and <ref type="bibr" target="#b37">[38]</ref>, and represents a more compact encoding of data than HTML and XML formats, while preserving all their advantages. To construct this model, some steps are performed: counting the nodes, enumerating children in a depth-first manner (in order to capture the structure of the document), and content encoding and mapping for each node -encoding which will preserve the natural order of text in the Web page.</p><p>SDiff <ref type="bibr" target="#b30">[31]</ref> is a Web change monitoring application that uses a digest format that includes also tag type and attribute information. RMS <ref type="bibr" target="#b37">[38]</ref> also uses this model, although without giving many details. The advantages of the Page Digest over the DOM tree model are enumerated in <ref type="bibr" target="#b30">[31]</ref>. We mention minimality and execution performance: the reduction of tag redundancy gives a more compact model, therefore a faster document traversal. The algorithms developed on this model run in linear time without making too many heuristics or restrictions, while capturing also a large palette of changes.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">SIMILARITY METRICS</head><p>Similarity metrics are used in dynamic methods of detecting change, in the matching stage of the two versions modelled as described in section 4.</p><p>For string models, (content) change is identified when the data strings are discovered to be partially unsimilar. For tree models that have various dimensions, the problem gets more complex.</p><p>The aim is to identify model elements that are essentially the same, but which have been affected by change. Model elements that are the identical are pruned because they have basically not changed between versions; also, totally dissimilar model elements do not represent instances of the object that has evolved, so will not be included in further processing steps. Essentially, only approximatively similar model elements will be further studied. A typical matching is based on comparisons of attribute values of the model elements. If they have the same properties, then they are similar.</p><p>We will describe next the types of similarity metrics used for change detection in different settings or applications, for the studied cases.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">String matching techniques</head><p>Approaches that compare two Web pages modeled as flat files (i.e. strings), rely on hash-based methods, edit distance metrics or techniques based on the longest common subsequence. We do not exhaustively cover all the possiblities, but rather present some of them that we have encountered in the analyzed works.</p><p>Naïve: Jaccard.</p><p>One simple technique for change detection in text is to count the number of words that are common (regardless of their position) for two string sequences and to divide the result by the number of distinct words. Another possiblity is to divide by the length of the first string, as done in <ref type="bibr" target="#b29">[30]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Hash-based methods.</head><p>In this category, we mention <ref type="bibr">[8,</ref><ref type="bibr" target="#b20">20,</ref><ref type="bibr" target="#b3">3]</ref>. The method of <ref type="bibr">[8]</ref> uses shingling, which is a flexible method of detecting changes between two strings. It is usually referred to as 𝑤-shingling, where 𝑤 the denotes the number of tokens in each shingle in the set. A shingle is a contiguous subsequence (𝑤-gram) of the reference string. For the two strings to be compared, their shingles are hashed; if these strings have a lot of shingle values in common, then they are similar.</p><p>Another method is to use signatures. A signature of a string is a function of its hashed value. When the model is hierarchical, the signature of a node is computed based on its terminal path. For a formatting (or non-leaf) node, the signature represents the sum of the signatures of its children, until leaf nodes, where changes are actually detected. In <ref type="bibr" target="#b20">[20]</ref>, only nodes that have different signatures from those in the original version will be compared. Change detection algorithms that employ signatures have the disadvantage that false negatives are possible: change exists, but a different signature for it does not. It obviously depends on the careful choice of the space of hashing, and eventually on the application: if it can tolerate or not false results.</p><p>Another hash-based approach is presented in <ref type="bibr" target="#b3">[3]</ref>, where the change is detected at atomic element level using a direct hash.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Longest common subsequence.</head><p>A diff is a file comparison program that outputs the differences between two files. Diff tools are based on computing the longest common subsequence between two strings <ref type="bibr" target="#b31">[32]</ref>. For instance, HTMLDiff uses GNU diff utility adapted for HTML page change detection. This program treats Web pages as strings and, after processing, highlights the changes directly in a merged document; as mentioned in <ref type="bibr" target="#b15">[15]</ref>, HTMLDiff can consume significant memory and computation resources, and this might have an influence on the scalability of the tool. Tree models of Web pages also use diff techniques, but not at HTML page level, but at subtree (or node) content level, which has the advantage of making the computation less complex. For instance, WebCQ <ref type="bibr" target="#b24">[24]</ref> uses HTMLDiff to detect change at object level. Here, the object represents in fact a part of a Web page specified for monitoring by the user, either by means of regular expressions or by marking elements of the HTML DOM tree like a table, list, link etc.. Another example of system that uses GNU diff tool is WebVigiL <ref type="bibr" target="#b18">[18]</ref>.</p><p>Edit scripting on strings.</p><p>The edit distance between two strings of characters represents the number of operations required to transform one string into another. There are different ways of defining an edit distance, depending on which edit operations are allowed: delete, insert, etc.. In string edit scripting, the atomic element is a single character and the cost is usually unitary, for every edit operation defined.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Root Mean Square.</head><p>Another notable way <ref type="bibr" target="#b37">[38]</ref> of compute similarity is to use RMS(Root Mean Square) value, i.e., the quadratic mean. RMS represents a statistical measure for the magnitude of a varying quantity and permits therefore to quantify the change. If its value is small, then the difference between the compared elements is not significative. Often used in engineering to do an estimation of the similarity between a canonical model and an empirical one (in order to see the precision of the experiment), RMS formula needs numeric values of the model parameters. In the HTML context, ASCII values of the Web document's text characters are utilized in the canonical formula. Although a good idea, RMS measure applied for the ASCII values of each character has some drawbacks: first it does not take into account the hierarchies (it considers that every character has equal influence, independently of its position in the page), and second, it cannot take into account the semantics of context. Variants of this measure are presented in <ref type="bibr" target="#b38">[39]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Matching of hierarchical models</head><p>Edit scripting on trees.</p><p>has the same formal definition as for strings, excepting for the fact that the basic entities are here tree elements (nodes or subtrees). Also, edit operations can occur at different levels, depending on the types of changes considered. As a consequence, cost models of edit operations become more complex. Each edit operation has a cost associated (cost proportional to the complexity of the operation, or based on heuristics of the model), so the execution of an edit script as a sequence of operations will return a cumulated cost. The similarity measure can be then computed as the inverse of this total cost. The interpretation is the following: less unimportant modifications we do to the first structure of data in order to make it isomorphic with its version, the more similar the two structures are.</p><p>This problem of determining the distance between two trees is referred to as the tree-to-tree correction problem and this is more in depth covered in <ref type="bibr" target="#b5">[5]</ref>. Some works <ref type="bibr" target="#b9">[9]</ref> report that edit scripting with moving operations is NPhard. Usually, every model introduces some kind of model or computational heuristics that make the problem (a little) less difficult. As an example, <ref type="bibr" target="#b23">[23]</ref> computes an edit scripting on branches. It can be also done on subtrees, rather than on complete trees, for the same complexity reasons. Concerning the cost of change operations, every technique makes its own decisions. MH-Diff <ref type="bibr" target="#b9">[9]</ref> for example defines the costs as being moderated by some constants, all depending on the type of change.</p><p>Quantitative measures of change.</p><p>Various works <ref type="bibr" target="#b17">[17,</ref><ref type="bibr" target="#b3">3,</ref><ref type="bibr" target="#b22">22]</ref> use a composed measure of similarity that tries to better adapt to the specific of the types of changes considered. The change is measured by a formula that incorporates three sub-measures of specific similarity: on the content (intersect: the percentage of words that appear in both textual content of subtrees), attributes (attdist: the relative weight of the attributes that have the same value in the model elements) and on the types of elements considered in the path (typedist emphasizes differences on the tag names when going up in the hierarchy). The final measure incorporates all above-defined types of similarity, together with some parameters that are meant to influence the importance of certain types of changes over others. The advantage of this measure is that it captures the symbiosis of different types of changes that occur in a certain way independently: content changes in leaf nodes, attribute changes in internal nodes; the third submeasure is more focused on the position of nodes in the structure.</p><p>Another quantitative measure of change is proposed in <ref type="bibr" target="#b23">[23]</ref>. Here, a weighted measure that determines the magnitude of the difference between two ordered multisets (i.e., branches) is employed. In an ordered multiset, the weight of the ith node is defined as (2 𝑖 ) −1 (where 𝑖 represents the depth of an element of the branch considered). Finally, the quantity of change is measured by computing the sum of the weights of the nodes that appear in the symmetric difference set.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">STATISTICALLY ESTIMATING CHANGE</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.1">Motivating estimative models</head><p>We have seen the multitude of domains that are interested in the change detection issue. The general aim is to get an idea about the dynamics of a certain type of content, at a given granularity.</p><p>In Web crawler-related applications, the interest is more in whether a Web page has changed or not, in order to know if a new version of a Web page shall be downloaded or not. Only a binary response is needed; this does not happen because it is not interesting to make a distinction between the different types of changes that can occur, but because current crawlers treat information at Web page level. In this case, an estimation of the change frequency is as effective as explicitly computing it, as we show in Section 3. Indeed, if archive crawlers were more aware of the semantics of data they process, they could clearly benefit of a broader, richer insight into the data and could develop different strategies related to storage and processing. An estimation of the change rate, although not very descriptive (we usually do not know where the change appeared or its type), is still useful, especially when we can imagine a strategy that combines estimative and comparative methods of deriving dynamics. For this reason, we shortly present some of the existing statistical approaches.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2">Poisson model</head><p>The study carried out in <ref type="bibr" target="#b10">[10]</ref> reports the fact that changes that occur in Web pages can be modeled as a Poisson process. A Poisson process is used to model a sequence of random TWAW 2011, Hyderabad, India events that happen independently with a fixed rate over time. Based on a(n ideally complete) change history, the frequency of change is estimated. The time-independence assumption in homogeneous models <ref type="bibr" target="#b10">[10,</ref><ref type="bibr" target="#b11">11]</ref> does not really capture the reality of the Web. The authors of <ref type="bibr" target="#b32">[33]</ref> affirm that in the case of dynamic Web content (like blogs), the posting rates vary very much depending on different parameters. Hence, they propose an inhomogeneous Poisson model (which do not assume that events happen independently); this model learns the posting patterns of Web pages and predicts a recheck for new content.</p><p>The work in <ref type="bibr" target="#b11">[11]</ref> formalizes some use cases where, by adapting the parameters of the Poisson model to the requirements of the application, a better accuracy of the estimation can be achieved. The situation when we do not have a complete change history of a Web page is treated, which is actually the real world case. Sometimes, we have only the last date of change or we at best just know that a change has occurred. Contributions are also related to the fact that the authors adapt the canonical model to interesting applications and feed different estimators, improving thus the technique for the considered cases.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.3">Kalman filters</head><p>Other statistical approach to change detection is that of <ref type="bibr" target="#b7">[7]</ref>. Here, the textual vector space model is employed to identify the patterns of the page and to train Kalman filters with these patterns. In the end, the change represents the event that does not match the prediction. The possible disadvantages of this method is that it assumes the linearity of the system (because of the the Kalman filter model, which is doing an exact inference in a linear dynamical system) and uses an incomplete vector space model (for complexity reasons).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.">OPEN QUESTIONS</head><p>We end this article by discussing several issues that have not been addressed yet and which are related to the deriv-ing of temporal properties of Web pages. The selection of subjects reflects our personal vision on the topic.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Further studies on timestamping.</head><p>With the large number of sources of timestamping hints, it should indeed be possible to estimate the freshness of a Web page. An experimental study on the reliability of these sources, perhaps in specific contexts (a given Web server software, a given CMS, etc.), which could provide more insight into optimal strategies for timestamping Web pages, is still to be carried out.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Relevant change detection.</head><p>An important question when detecting changes is whether the changes are relevant to the interests or needs of the user or archivist. Web users are exposed to many advertisements and content becomes pre-fabricated, put in containers and delivered in a targeted way, so much more attention should be paid to the relevance factor. Additionally, Web pages have some components that are more dynamic than others; it is not possible to say if the dynamics come from irrelevant content (think of ads changing at every request) or precisely because the content is very informative (e.g. frequently updated news). To make the distinction between these, techniques that define and extract semantics (e.g., by looking at the topic and linked data) might be used as a filtering method or just for adding more significance to the change detection results. The cleaning of the Web page or its segmentation into semantic blocks is of interest in many information extraction fields and, although we mention only <ref type="bibr" target="#b39">[40,</ref><ref type="bibr" target="#b21">21,</ref><ref type="bibr" target="#b28">29]</ref>, there exists a large number of works that treat this subject.</p><p>As an observation, there exists a subtle difference in the use of the term "meaningful" in the various works that we have studied. While some works <ref type="bibr" target="#b23">[23]</ref> use it to emphasize the fact that more types of changes are detected, other approaches <ref type="bibr" target="#b29">[30]</ref> use it as synonym to "relevant", from the content point of view. Vi-Diff <ref type="bibr" target="#b29">[30]</ref> uses the VIPS algorithm <ref type="bibr" target="#b39">[40]</ref> to get from a Web page an hierarchy of semantic blocks and detect changes only from this perspective, hopefully ignoring all boilerplate content. However, a deeper insight into the relevance aspect is mentioned as future work; the authors of <ref type="bibr" target="#b29">[30]</ref> talk about using machine learning techniques for this issue, which would be an interesting line of research. Usually, heuristics <ref type="bibr" target="#b39">[40]</ref> are employed to get a measure of relevance because having a generic source of knowledge which can be automatically used for this task is very difficult.</p><p>Recently, <ref type="bibr" target="#b28">[29]</ref> used the concepts that can be extracted from the description of Web feeds to get the content of interest from Web pages. However, this kind of semantics can only be obtained in the case of Web pages that have feeds associated.</p><p>With the emergence of the Semantic Web, we envision new ways of distinguishing timestamps or of filtering irrelevant content from Web pages and therefore more efficient methods for deriving dynamics of Web pages.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Summary of the presented approaches</figDesc></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">Examples of these include http://timelyweb.en. softonic.com/, http://www.urlywarning.net/, http:// www.changealarm.com/, http://www.changedetect.com/.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">http://www.mediawiki.org/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2">http://www.apache.org/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_3">http://dublincore.org/documents/dcmi-terms/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_4">http://tidy.sourceforge.net/ TWAW 2011, Hyderabad, India</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_5">http://www.cc.gatech.edu/projects/disl/ PageDigest/</note>
		</body>
		<back>

			<div type="funding">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>* This research was funded by the European Research Council grant Webdam FP7-ICT-226513.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<title/>
		<author>
			<persName><surname>References</surname></persName>
		</author>
		<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Issues in monitoring web data</title>
		<author>
			<persName><forename type="first">S</forename><surname>Abiteboul</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. DEXA</title>
				<meeting>DEXA</meeting>
		<imprint>
			<date type="published" when="2002">2002</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">The Web changes everything: Understanding the dynamics of Web content</title>
		<author>
			<persName><forename type="first">E</forename><surname>Adar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Teevan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">T</forename><surname>Dumais</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">L</forename><surname>Elsas</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. WSDM</title>
				<meeting>WSDM</meeting>
		<imprint>
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">A fast HTML Web change detection approach based on hashing and reducing the TWAW 2011, Hyderabad, India number of similarity computations</title>
		<author>
			<persName><forename type="first">H</forename><surname>Artail</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Fawaz</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Data Knowl. Eng</title>
		<imprint>
			<biblScope unit="volume">66</biblScope>
			<biblScope unit="issue">2</biblScope>
			<date type="published" when="2008">2008</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Web dynamics, structure, and page quality</title>
		<author>
			<persName><forename type="first">R</forename><surname>Baeza-Yates</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Castillo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Saint-Jean</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Web Dynamics</title>
				<editor>
			<persName><forename type="first">M</forename><surname>Levene</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><surname>Poulovassilis</surname></persName>
		</editor>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2004">2004</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<title level="m" type="main">Tree-to-tree correction for document trees</title>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">T</forename><surname>Barnard</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Clarke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Duncan</surname></persName>
		</author>
		<idno>95-372</idno>
		<imprint>
			<date type="published" when="1995">1995</date>
			<pubPlace>Kingston, Ontario, Canada</pubPlace>
		</imprint>
		<respStmt>
			<orgName>Queen&apos;s University</orgName>
		</respStmt>
	</monogr>
	<note type="report_type">Technical Report</note>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Parallel asynchronous Hungarian methods for the assignment problem</title>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">P</forename><surname>Bertsekas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">A</forename><surname>Castañon</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">INFORMS J. Computing</title>
		<imprint>
			<biblScope unit="volume">5</biblScope>
			<biblScope unit="issue">3</biblScope>
			<date type="published" when="1993">1993</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Application of Kalman filters to identify unexpected change in blogs</title>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">L</forename><surname>Bogen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Ii</surname></persName>
		</author>
		<author>
			<persName><forename type="first">U</forename><surname>Johnson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Karadkar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><forename type="middle">M</forename><surname>Furuta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Iii</forename><surname>Shipman</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. JCDL</title>
				<meeting>JCDL</meeting>
		<imprint>
			<date type="published" when="2008">2008</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">On the resemblance and containment of documents</title>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">Z</forename><surname>Broder</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. SEQUENCES</title>
				<meeting>SEQUENCES</meeting>
		<imprint>
			<date type="published" when="1997">1997</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Meaningful change detection in structured data</title>
		<author>
			<persName><forename type="first">S</forename><surname>Chawathe</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Garcia-Molina</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. SIGMOD</title>
				<meeting>SIGMOD</meeting>
		<imprint>
			<date type="published" when="1997">1997</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">The evolution of the Web and implications for an incremental crawler</title>
		<author>
			<persName><forename type="first">J</forename><surname>Cho</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Garcia-Molina</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. VLDB</title>
				<meeting>VLDB</meeting>
		<imprint>
			<date type="published" when="2000">2000</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Estimating frequency of change</title>
		<author>
			<persName><forename type="first">J</forename><surname>Cho</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Garcia-Molina</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ACM TOIT</title>
		<imprint>
			<biblScope unit="volume">3</biblScope>
			<biblScope unit="issue">3</biblScope>
			<date type="published" when="2003">2003</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Concerning Etags and datestamps</title>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">R</forename><surname>Clausen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. IWAW</title>
				<meeting>IWAW</meeting>
		<imprint>
			<date type="published" when="2004">2004</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">A comparative study of XML change detection algorithms</title>
		<author>
			<persName><forename type="first">G</forename><surname>Cobéna</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Abdessalem</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Service and Business Computing Solutions with XML</title>
				<imprint>
			<publisher>IGI Global</publisher>
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Detecting changes in XML documents</title>
		<author>
			<persName><forename type="first">G</forename><surname>Cobéna</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Abiteboul</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Marian</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. ICDE</title>
				<meeting>ICDE</meeting>
		<imprint>
			<date type="published" when="2002">2002</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">The AT&amp;T Internet difference engine: Tracking and viewing changes on the Web</title>
		<author>
			<persName><forename type="first">F</forename><surname>Douglis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Ball</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y.-F</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Koutsofios</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">World Wide Web</title>
		<imprint>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="issue">1</biblScope>
			<date type="published" when="1998">1998</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">A large-scale study of the evolution of Web pages</title>
		<author>
			<persName><forename type="first">D</forename><surname>Fetterly</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Manasse</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Najork</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Wiener</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. WWW</title>
				<meeting>WWW</meeting>
		<imprint>
			<date type="published" when="2003">2003</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Efficient and effective Web page change detection</title>
		<author>
			<persName><forename type="first">S</forename><surname>Flesca</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Masciari</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Data Knowl. Eng</title>
		<imprint>
			<biblScope unit="volume">46</biblScope>
			<biblScope unit="issue">2</biblScope>
			<date type="published" when="2003">2003</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">WebVigiL: An approach to just-in-time information propagation in large network-centric environments</title>
		<author>
			<persName><forename type="first">J</forename><surname>Jacob</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Sanka</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Pandrangi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Chakravarthy</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Web Dynamics</title>
				<editor>
			<persName><forename type="first">M</forename><surname>Levene</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><surname>Poulovassilis</surname></persName>
		</editor>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2004">2004</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Detecting age of page content</title>
		<author>
			<persName><forename type="first">A</forename><surname>Jatowt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Kawai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Tanaka</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. WIDM</title>
				<meeting>WIDM</meeting>
		<imprint>
			<date type="published" when="2007">2007</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">A novel approach for web page change detection system</title>
		<author>
			<persName><forename type="first">H</forename><forename type="middle">P</forename><surname>Khandagale</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">P</forename><surname>Halkarnikar</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Intl. J. Comput. Theory Eng</title>
		<imprint>
			<biblScope unit="volume">2</biblScope>
			<biblScope unit="issue">3</biblScope>
			<date type="published" when="2010">2010</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">Boilerplate detection using shallow text features</title>
		<author>
			<persName><forename type="first">C</forename><surname>Kholschutter</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Fankhauser</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Nejdi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. WSDM</title>
				<meeting>WSDM</meeting>
		<imprint>
			<date type="published" when="2010">2010</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<analytic>
		<title level="a" type="main">An efficient Web page change detection system based on an optimized Hungarian algorithm</title>
		<author>
			<persName><forename type="first">I</forename><surname>Khoury</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">M</forename><surname>El-Mawas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>El-Rawas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">F</forename><surname>Mounayar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Artail</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE TKDE</title>
		<imprint>
			<biblScope unit="volume">19</biblScope>
			<biblScope unit="issue">5</biblScope>
			<date type="published" when="2007">2007</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b23">
	<analytic>
		<title level="a" type="main">An automated change-detection algorithm for HTML documents based on semantic hierarchies</title>
		<author>
			<persName><forename type="first">S.-J</forename><surname>Lim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y.-K</forename><surname>Ng</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. ICDE</title>
				<meeting>ICDE</meeting>
		<imprint>
			<date type="published" when="2001">2001</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b24">
	<analytic>
		<title level="a" type="main">WebCQ: Detecting and delivering information changes on the Web</title>
		<author>
			<persName><forename type="first">L</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Pu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Tang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. CIKM</title>
				<meeting>CIKM</meeting>
		<imprint>
			<date type="published" when="2000">2000</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b25">
	<monogr>
		<title level="m" type="main">Web Archiving</title>
		<author>
			<persName><forename type="first">J</forename><surname>Masanès</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2006">2006</date>
			<publisher>Springer-Verlag</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b26">
	<analytic>
		<title level="a" type="main">What&apos;s new on the Web? The evolution of the Web from a search engine perspective</title>
		<author>
			<persName><forename type="first">A</forename><surname>Ntoulas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Cho</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Olston</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. WWW</title>
				<meeting>WWW</meeting>
		<imprint>
			<date type="published" when="2004">2004</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b27">
	<analytic>
		<title level="a" type="main">Using neighbors to date Web documents</title>
		<author>
			<persName><forename type="first">S</forename><surname>Nunes</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Ribeiro</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>David</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. WIDM</title>
				<meeting>WIDM</meeting>
		<imprint>
			<date type="published" when="2007">2007</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b28">
	<analytic>
		<title level="a" type="main">Archiving data objects using web feeds</title>
		<author>
			<persName><forename type="first">M</forename><surname>Oita</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Senellart</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. IWAW</title>
				<meeting>IWAW</meeting>
		<imprint>
			<date type="published" when="2010">2010</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b29">
	<analytic>
		<title level="a" type="main">A novel Web archiving approach based on visual pages analysis</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Pehlivan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">Ben</forename><surname>Saad</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Gançarski</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. IWAW</title>
				<meeting>IWAW</meeting>
		<imprint>
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b30">
	<analytic>
		<title level="a" type="main">Page digest for large-scale Web services</title>
		<author>
			<persName><forename type="first">D</forename><surname>Rocco</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Buttler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Liu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. CEC</title>
				<meeting>CEC</meeting>
		<imprint>
			<date type="published" when="2003">2003</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b31">
	<analytic>
		<title level="a" type="main">Hardness of string similarity search and other indexing problems</title>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">C</forename><surname>Sahinalp</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Utis</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. ICALP</title>
				<meeting>ICALP</meeting>
		<imprint>
			<date type="published" when="2004">2004</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b32">
	<analytic>
		<title level="a" type="main">Efficient monitoring algorithm for fast news alerts</title>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">C</forename><surname>Sia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Cho</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H.-K</forename><surname>Cho</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE TKDE</title>
		<imprint>
			<biblScope unit="volume">19</biblScope>
			<biblScope unit="issue">7</biblScope>
			<date type="published" when="2007">2007</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b33">
	<analytic>
		<title level="a" type="main">Incremental crawling with Heritrix</title>
		<author>
			<persName><forename type="first">K</forename><surname>Sigurðsson</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. IWAW</title>
				<meeting>IWAW</meeting>
		<imprint>
			<date type="published" when="2005">2005</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b34">
	<monogr>
		<ptr target="http://www.sitemaps.org/protocol.php" />
		<title level="m">Sitemaps XML format</title>
				<imprint>
			<date type="published" when="2008">2008</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b35">
	<monogr>
		<ptr target="http://www.w3.org/TR/REC-html40/" />
		<title level="m">W3C</title>
				<imprint>
			<date type="published" when="1999">1999</date>
			<biblScope unit="volume">4</biblScope>
		</imprint>
	</monogr>
	<note>01 specification</note>
</biblStruct>

<biblStruct xml:id="b36">
	<analytic>
		<title level="a" type="main">X-Diff: An effective change detection algorithm for XML documents</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">J</forename><surname>Dewitt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J.-Y</forename><surname>Cai</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. ICDE</title>
				<meeting>ICDE</meeting>
		<imprint>
			<date type="published" when="2003">2003</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b37">
	<analytic>
		<title level="a" type="main">Change detection in Web pages</title>
		<author>
			<persName><forename type="first">D</forename><surname>Yadav</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">K</forename><surname>Sharma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">P</forename><surname>Gupta</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. ICIT</title>
				<meeting>ICIT</meeting>
		<imprint>
			<date type="published" when="2007">2007</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b38">
	<analytic>
		<title level="a" type="main">Parallel crawler architecture and web page change detection</title>
		<author>
			<persName><forename type="first">D</forename><surname>Yadav</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">K</forename><surname>Sharma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">P</forename><surname>Gupta</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">WSEAS Transactions on Computers</title>
		<imprint>
			<biblScope unit="volume">7</biblScope>
			<biblScope unit="issue">7</biblScope>
			<date type="published" when="2008">2008</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b39">
	<analytic>
		<title level="a" type="main">Improving pseudo-relevance feedback in Web information retrieval using Web page segmentation</title>
		<author>
			<persName><forename type="first">S</forename><surname>Yu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Cai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J.-R</forename><surname>Wen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W.-Y</forename><surname>Ma</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. WWW</title>
				<meeting>WWW</meeting>
		<imprint>
			<date type="published" when="2003">2003</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
