A Dataflow Approach To Efficient Change Detection of HTML/XML Documents in WebVigiL∗ Anoop Sanka, Shravan Chamakura, Sharma Chakravarthy Department of Computer Science & Engineering The University of Texas at Arlington {asanka,chamakur,sharma}@cse.uta.edu Abstract VigiL to highlight role of change detection graph (CDG) which forms the core of the WebVigiL The burgeoning data on the Web makes it dif- system. ficult for one to keep track of the changes that constantly occur to specific information of inter- est. Currently, the most widespread way of de- 1 Introduction tecting changes occurring to Web content is to manually retrieve the pages of interest and check The world wide web has outdone traditional me- them for changes. This mode of action not only dia such as television, to become an indispens- wastes useful resources, but also presents infor- able source of information. There is data for mation that may not be relevant to the given everyone and everything on the Web, and this context. data is increasing at a rapid pace. This plethora In this paper, we present a change-monitoring of information often leads to situations wherein system – WebVigiL – which efficiently monitors users looking for specific pieces of information user-specified web pages for customized changes are flooded with irrelevant data. For example, an and notifies the user in a timely manner. We information technology professional who is only present the dataflow approach used for detect- interested in news related to his areas of inter- ing multiple types of changes to a page. This est, is flooded with news on many topics when approach has been optimized to group simi- he goes to any popular news website. There are lar/same specifications to reduce the computa- other situations in which users are interested in tion of changes. Multiple changes to the same knowing about the updates happening to a par- page can also be handled in our approach. We ticular web page of their interest. For example, also provide the overall architecture of Web- students might want to know about the updates ∗ that are made to their course websites regard- This work was supported, in part, by the Office of Naval Research & the SPAWAR System Center-San Diego ing any new projects. In other scenarios there & by the Rome Laboratory (grant F30602-01-2-05430), are large software development projects where and by NSF (grant IIS-0123730). there are number of documents, such as require- 1 76 ments analysis, design specification, detailed de- or XML pages. Changes to pages and changes sign document and implementation documents. to images, links, keywords and etc. correspond Typically, a large number of people are work- to primitive events when mapped to the ECA ing on the project and managers need to be paradigm and their combinations form compos- aware of the changes to any one of the docu- ite events. Thus, some of the techniques devel- ments to make sure the changes are propagated oped for active databases, when extended appro- properly to other relevant documents. In gen- priately, will provide a solution to detect changes eral, the ability to specify changes to arbitrary to web pages. This paper focuses on developing documents and get notified in different ways will a framework and to provide a selective propa- be useful for reducing the wasteful navigation. gation approach to detect changes that are of The proposed architecture also provides a pow- interest to the users in the context of web and erful way to disseminate information efficiently other large-scale network-centric environments without sending unnecessary or irrelevant infor- by adapting and extending the existing active mation. technology. Today, information retrieval is mostly done us- The remainder of the paper is organized as fol- ing the pull paradigm where the information of lows. In section 2 we present the related work interest is pulled from its source by the user giv- that has been done in the field of change detec- ing a query (or queries). The user takes the tion of web content. In section 3 we present the burden of analyzing the pulled information for architecture of the WebVigiL system. In section any changes of interest. Although there are ap- 4 we show how the ECA paradigm is incorpo- proaches, such as mailing lists, to notify users of rated into the system. In section 5 we present the the changes that happen to information of inter- core component of the WebVigiL system which est, there is little or no scope for the user to cus- is the Change Detection Graph (CDG). Finally, tomize those notifications. The user has to be we discuss future work in section 6. satisfied with whatever information the source wishes to send rather than what specific infor- mation the user wants. 2 Related Work A great deal of research has been done in the field of active technology to provide the capa- Many tools have been developed and are cur- bility of timely responses for many applications. rently available for tracking changes to web Event-Condition-Action (or ECA) rules are used pages. AIDE (AT&T Internet Difference En- to provide active capability to a system. In gine) [4], developed by AT&T shows the differ- the case of large-scale distributed environments ence between two HTML pages. The granular- such as the Web, users are interested in mon- ity of change detection is restricted to a page in itoring changes to a particular web page. But AIDE. It is not possible to view changes at a there are instances in which the change detec- finer level of granularity, such as links within a tion is required at a finer granularity, such as page, keywords, images, tables, lists or phrases. specifying changes to links, images, phrases or NetMind [9] formerly known as URL-minder keywords in a page. Web pages that are moni- provides keyword or text-based change detection tored for detecting changes may be either HTML and notification service over web pages. Net- 2 77 Mind detects changes to links, images, keywords compared with the above systems are: and phrases in an HTML page. The media of notification are e-mail or mobile phone. The sys- 1. Properties of monitoring requests can be in- tem lacks the support to specify ignoring changes herited: The user has the option of specify- which the user is not interested in. Also, it lacks ing the monitoring request to be dependent the support to specify composite changes (when on the status of other monitoring requests. both links AND images change) on a page. Also One can specify the start/end of a request there is no provision for the user to come back to be the start/end of another request. later and view the last changes that have been 2. Flexible specification of versions: All the detected. The frequency of when to poll the page above systems compute changes between is predefined. two successive pages. In WebVigiL, the user WebCQ [8] is a prototype system for large- can explicitly specify the pages that can par- scale web information monitoring and deliv- ticipate in change detection. ery, which makes use of the structure present in hypertext and the concept of continuous 3. Composite change detection: WebVigiL pro- queries. WebCQ is designed to discover and de- vides an elegant way to specify multiple tect changes to the web pages and to provide a change types, such as ’changes occurring to personalized notification of the changes to the either images or links’, which none of the users. User’s monitoring requests are modelled above systems provide. as continuous queries on the web. The authors specify that composite changes can be detected but , currently the system does not seem to 3 WebVigiL Architecture support them. WebCQ lacks a fine grouping WebVigiL is a change-detection and notification strategy which results in the change being com- system, which can monitor and detect changes to puted more than once for two users having the unstructured documents in general. WebVigiL same type of request. The system only supports aims at investigating the specification, manage- HTML documents and the frequency of fetching ment and propagation of changes as requested by is at the level of day which does not prove good the user in a timely manner while meeting the for situations which require a short frequency of quality of service requirements. Figure 1 summa- fetching. Also there is no provision to specify rizes the complete architecture of WebVigiL. The new requests based on old requests. functionality of each module is described briefly WYSIGOT [11]is a commercial application in the following sections. that can be used to detect changes to HTML pages. This system has to be installed on the 3.1 Sentinel local machine, which is not always possible. The system has the feature to monitor an HTML WebVigiL provides an expressive language with page and also all the pages that it points to. well-defined semantics for specifying the moni- But the granularity of change detection is at the toring requirements pertaining to the Web. Each page level. monitoring request is termed a Sentinel. A Sen- Some of the salient features of WebVigiL when tinel encompasses the following: the target URL, 3 78 the change type desired (which can be links, details of the sentinels set by them. The details images, phrases, keywords or a combination of of the sentinels have to be stored on a persistent these using the OR, AND, NOT operators), medium for the purposes of various modules and specifying a fetch frequency if known or leaving it also to provide recovery to a stable state in case to the system to adapt to the changes, what ver- of system failure. All the modules in the sys- sions of fetched pages to compare changes (pair- tem interact with the Knowledge base for their wise, every n or moving n) and the change notifi- proper functioning. cation mode (e-mail, PDA, fax). A Sentinel can also inherit the properties of previously defined sentinels and depend on their life-cycles. 3.4 Change Detection Module For example, the Sentinel created for a request to monitor http://www.cnn.com for any updates Every valid user request arriving at WebVigiL related to Iraq, for a period of 2 years starting initiates a series of operations that occur at dif- from now, would be mapped as follows: ferent points in time. Some of these opera- Create Sentinel s1 on the URL tions are: creation of a sentinel (based on start http://www.cnn.com time), monitoring the requested page, detecting Monitor for keywords “Iraq”, Fetch using changes of interest, notifying the user(s) of the ‘Best effort’ change, and deactivation of sentinels. This mod- From NOW to NOW + 2 Years ule generates ECA [1] rules to perform the ac- Notify by e-mail to user@uta.edu every 4th day tions of: activating and deactivating sentinels, Compare alternate versions of fetched pages. constructing and maintaining Change Detection A detailed explanation is given in [7]. Graph and generating fetch rules. Change detec- tion algorithms that give the difference between two HTML/XML pages have been developed [7] 3.2 Verification Module [10]. This module processes user requests for syntac- tic and semantic correctness. Valid sentinels are populated in the Knowledge base (currently, Or- 3.5 Fetch Module acle) and a notification of the valid sentinels is The fetch module [3] is responsible for retrieving sent to the change detection module. Briefly, the main functions of this module are: load balanc-the pages registered with it and thus serves as a ing and syntactic validation between client andlocal wrapper for the task of fetching pages de- pending upon the user-set fetching policy. This server and semantic validation of sentinels at the server, as the dependency information specifiedmodule informs the version controller of every in the sentinel is available at the server. version it fetches, stores it in the page reposi- tory and notifies the CDG of a successful fetch. The wrapper fetches a page only when its prop- 3.3 Knowledge Base erties indicate that a change has occurred. These Knowledge base is a persistent repository con- properties are: the last-modified time for static taining meta-data about each user, count and web pages and checksum for dynamic web pages. 4 79 Figure 1: WebVigiL Architecture 3.6 Version Management manner. The different ways in which changes can be displayed are: merging two documents An important feature of WebVigiL architec- with the changes highlighted, displaying only the ture is its server-based repository service, which changes in a single page and highlighting the dif- archives and manages different versions of pages. ferences in both the pages side-by-side. Notifi- The primary purpose of the repository service cation of changes can be done according to the is to reduce the network traffic by reducing the user-specified frequency or whenever a change is number of network connections made to the re- detected. mote server. When a request is made for a page, the version management checks for the page in its cache and returns it if it is present. Other- 4 Activation & Deactivation of wise, the page is fetched and stored for future ECA rules requests. WebVigiL uses the Java Local Event Detec- 3.7 Presentation Module tor(LED) to incorporate active capability in the system. LED is a library designed to provide The primary functionality of this module is to support for primitive and composite events, and clearly present the detected changes in a legible rules in Java applications in a seamless manner. 5 80 triggered at the specified time point, rule T1 is executed, which in turn raises the event Start S1. Triggering of the event Start S1 activates the sentinel S1 and also initiates the periodic event used for fetching the pages of URL specified in S1. Now, if another sentinel S2 which is defined over the interval [start(S1), end(S1)] arrives, the Figure 2: Event Generation events and rules are generated in order to enable S2 are shown in Figure 2. Here, we are associat- Primitive and composite event detection in vari- ing the rule R start S2 with the event Start s1, ous parameter contexts and coupling modes has which was created at the arrival of sentinel S1. been implemented. This rule actually raises the Start S2 event to During its lifespan, a sentinel is active and par- activate the periodic event associated with S2. ticipates in change detection. A sentinel can be In this manner, ECA rules are used to asyn- disabled (does not detect changes during that pe- chronously activate and deactivate sentinels at riod) or enabled (detects changes). By default, run time. Once the appropriate events and rules a sentinel is enabled during its lifespan. The are created, the local event detector handles the user can also explicitly change the state of the execution at run time. By enabling/disabling of sentinel during its lifespan. The start/end of a sentinel, we mean addition/deletion of that sen- sentinel can be time points or events. When a tinel to the change detection graph. sentinel’s start time is now, it is enabled imme- diately. But in cases where the start is at a later time point or depends on another event that has 5 Change Detection Graph not occurred, we need to enable the sentinel only when the start time is reached or the event of The assumption while developing the WebVigiL interest has occurred. In WebVigiL, the ECA system has been that even though there will rule generation module creates the appropriate be requests for different change types on differ- events and rules to enable/disable sentinels. We ent URLs, there will be overlaps among URLs, achieve this as follows. Consider the scenario types of changes, frequency of access, etc. One where S1 is defined in the interval [06/02/04, of the goals of WebVigiL is to process sentinels 01/02/05]. At time 06/02/04 sentinel S1 has to efficiently and be able to scale to a very large be enabled. Figure 2 shows the events and rules number of sentinels. A very naı̈ve approach for that are generated to enable sentinel S1. change detection is to maintain a hash table with Fetch S1 is a periodic event created with the pages of interest as the keys and the values Start S1 as the start event, the frequency of page being the list of sentinels monitoring that page. fetch, and End S1 as the end event. The rule as- When a page is fetched, the sentinel(s) on that sociated with it handles the fetching of page for page are extracted and change type is detected S1. A rule associated with an event is fired when for each sentinel. This approach is not efficient the event is triggered. More than one rule can be as it results in redundant change calculations if associated with an event. When event Temp1 is more than one sentinel is interested on the same 6 81 Primitive change detection involves detecting changes to links, images, keywords etc., in a page. In order to facilitate primitive change detection, grouping of sentinels, and data flow we construct a graph. This graph is referred to as the change detection graph (CDG). The graph is constructed bottom up as shown in Figure 5.3. The different types of nodes in the graph are as follows: page for the same change-type. To remove the redundant calculations, a different approach can be employed wherein the sentinels in the hash- table are grouped based on the change-type. This means that the values present in the hash- table will contain a list of sentinel lists which are grouped on the same change-type. This re- sults in some efficiency, but it becomes difficult to implement composite change types (like im- Figure 5.3: Change Detection Graph ages AND links) using this approach. Figure 3: Change Detection Graph We need a data structure that will allow • us URL node (Un ): A URL node is a leaf node at level-0 (L0 ) that denotes the interest (e.g., “www.uta.edu”). The number to asynchronously feed fetched pages for change of URL page of interest (e.g.,nodes in the graph “www.uta.edu”). is equal The number to the of URL nodes in the detection, allow parallelism where possible, opti- number of distinct pages the system is mon- mize the computation by grouping sentinels over graph is equal to theatnumber itoring of distinct pages that particular the system instant is monitoring of time. A at URL’s and change types, and facilitate compos- page is fetched at this level and propagated ite change detection using the same paradigm as that particular instant of time. At this level whenever the version of a page to respective nodes at level-1. primitive change detection. Deletion and prop- is fetched (treated as fetch event), it is propagated to respective nodes at agation of delete semantics must be straightfor- • Change type node (Cn ): All level-1 (L1 ) ward in the representation chosen. Although level-1. a nodes in the graph are change type nodes. number of data structures have been proposed These nodes represent the the type of change in the literature for event detection, such as on a page (links, images, keywords, phrases Petri nets [5], extended automata [6], it has been etc.,). Change detection of pages is per- shown that event graphs support the require- formed at this 37 level. The maximum number ments at the granularity and grouping that is of change type nodes that are created in the appropriate for our problem. Hence, we have system is equal to the product of the number adapted and extended the event graph approach of change types supported and the number proposed for snoop [3] for detecting primitive as of URL nodes present at that instant. well as composite change. In the graph, to facilitate the propagation Primitive change detection involves detecting of changes, the relationship between nodes at changes to links, images, keywords, etc., in a different levels is captured using the subscrip- page. In order to facilitate primitive change de- tion/notification mechanism. The higher-level tection, grouping of sentinels, and data flow we nodes subscribe to the lower level nodes in the construct a graph. This graph is referred to as graph. This subscription information is main- the Change Detection Graph (CDG). The tained in the subscriber list at each node. This graph is constructed bottom-up as shown in Fig- subscriber list at each node contains the follow- ure 3. ing: The different types of nodes in the graph are as follow: Level-0: Contains references of level-1 • URL node (Un ): A URL node is a leaf nodes. node at level-0 (L0 ) that denotes the page of Level-1: Contains references of sentinels 7 82 own in Figure 5.4. The node references are maintained at the URL node (page i). Table 5-2. Compare-options supported Compare- Descriptions Point of detection the new version of the pagei is fetched, it is propagated to the links and options images Pair-wise Detects changes to consecutive Once the subsequent version versions of the same page arrives Moving:n Detects changes to versions “i” and (i- Once the nth version arrives At each change type node, the previous version is retrieved from the version n+1) (where “i” is the version of the page fetched and “n” is the moving window size) of the same page oller and the appropriate (links, images) change is computed. If there isEvery:n change, the Detects changes to every “n” versions Every change is detected after of the same page waiting for “n” versions. These versions are termed as waiting els subscribed to it are notified, in this case it is notified to S1 and S3. versions. Figure 5.5: Compare-Options with Every: 4. Figure Consider the5: following Compare-Options example where S1 and withS2 areEvery:4 on-change sentinels Figure 5.4: Primitive Change Example and projects monitoring related the same change to page on the same that course. (P) with This ofweb- the compare-option “every:4” page is updated as the course progresses and all as shown in Figure 5.5. The points on the time line denote either the arrival of sentinels Figure 4: Primitive Change Example the or the students versions of pagewould be interested P. For S1, change is detected betweeninthethe updates versions (p1 , p4 ) and so monitoring When a sentinel reaches its endthat change. time or is explicitly disabled by happening the user, the to this page. Similarly, many users on. When S2 arrives, since there is already a cached version p2 the change is detected might be interested on updates happening to a between (p2 , p5 ) and so on. For S3 the change is computed with (p4 , p7 ). Hence S1 and S2 The arrows el no longer participates in the detection. in change graph represent the data flow. This information is propagated to the on different topics. One might be news website For example, consider two sentinels, S 1 monitor- cannot be grouped together whereas S1 and S3 can be grouped. This behavior is interested in sports while another might be in- e type node withing changes which to linksis and the sentinel S3 monitoring associated. Since thechanges change type node applicable terested onlyis in apolitics. The sentinels on these key- to sentinels with every as the compare-option since only selected versions to images on URLi as shown in Figure 4. When words can be grouped and the page can be mon- riber to the URL anode,new itversion decidesofonthe whether to remove page with URLitsi issubscription fetched, based on the itored for a union of the keywords {sports, pol- 42 it is propagated to the links and images node. itics, ...}. But sentinel grouping is not possible sentinels that areAt associated with it.type each change Oncenode, this information is propagated to the URL change is computed for fixed-interval sentinels as each sentinel has a using a previous version of the page. If there is it removes the references different version. In spite of a sentinel belong- a change,ofsentinels the corresponding S1 and S2change type node and does are notified. not send ing to on-change, the other attribute that plays When a sentinel is disabled, the information to ext version of the page for change computation. For example, if sentinela Srole 1 shownin the in grouping strategy is the compare- stop it from performing change computation is options (pair-wise, moving:n, every:n). For ex- propagated to the change type node with which ample, requests for pairwise comparison cannot e 5.4 is disabled, the next version of the page is not propagated to the links node it is associated. The change type node decides be grouped with requests for every:3 (i.e, the URL node (page on),whether to propagate since there are no otherthis sentinel information to the that are interested current in links version compared with the third ver- i URL node based on other sentinels on that URL. sion from now). Consider the following example Once the URL node gets the information, it re- where S and S are on-change sentinels mon- 1 2 moves the references of the corresponding change itoring the same change on the same page (P) type nodes and does not send any more new ver- with the compare-option of “every:4” as shown sions to it. For example, if S1 shown in Figure 3 in Figure 5. The points on the time line denote is disabled, the next 39version of the page fetched is the arrival of sentinels and version os page P. not propagated to the links node from the URL For S , change is detected between the versions 1 node. (p1 , p4 ). When S2 arrives, since there is already To attain system scalability and better per- a cached version p , the change is detected be- 2 formance, sentinels are grouped when there is tween (p , p ) and so on. For S , the change is 2 5 3 more than one sentinel interested on the same computed with (p , p ). Hence S and S cannot 4 7 1 2 change type on the same URL. For example, con- be grouped together whereas S and S can be 1 3 sider a course webpage listing the assignments 8 83 change), compare-options. When the compare-option is “every:n” then the corresponding time at which the sentinel arrives is taken into consideration. Figure 5.6: Grouping Data Structure Figure 6: Grouping Data Structure grouped. Thus,Figure 5.6 shows the sentinels the information are grouped used for grouping sentinels based on the on a combina- tion of change type, fetch type (on-change) and strategy explained above. compare-options. When Thethe“subscriber list ptr” contains compare-option is all sentinels that belong to the “every:n”, then the corresponding time at which same group. The “word set” attribu te is null for links, images and any-change or union the sentinel arrives is taken into consideration. Figure 5.9. 7: Figure Composite CompositeChange Detection Change Graph Graph Detection Figure 6 shows the grouping data structure. AND node. At the AND node, S5 is informed So far primitive change detection has been dis- only when it receives notifications from both its cussed. A composite event is an event expression As shown in Figure 5.9 composite event nodes are in the levels L2 and constituent Sand sentinels. comprising a set of events connected through 43 above. Composite Node represents a combination of change types through the operators one or more composite event operators which are NOT, AND, OR. As shown in Figure 7, compos- 6 Conclusion and Future Work NOT, AND, and OR. They can extend to any number of levels. These nodes are created ite event nodes are at levels L2 and above. The change is computed forfor allevery the sentinels The first sentinel monitoring version of a composite theLevel-2 event. WebVigiL system and above contains references present in the subscriber list at the change type has been implemented and can be accessed of the nodes node. Hence, for sentinels monitoring belonging compos- fromto the immediate higher level (composite event http://berlin.uta.edu:8081/webvigil. All containing more ite changes, a representation at its constituent the modules shown in Figure 1 has been imple- change type node is needed. This than is two constituents) or implemented sentinels. mented. The system is being extended in a num- by creating proxy sentinels with the same prop- ber of ways. The current architecture of We- erties of the original sentinel at each of Thethechange con- isbVigiL computed for all the facilitates the sentinels monitoringpresent in thepages of only subscriber list at the stituent change type node. Consider the scenario without embedded frames. To monitor a page change type node. Hence, for sentinels monitoring composite changes, a representation at where sentinel S5 is interested in links and im- with frames, the exact URL of the frame has ages change on pagei (Figure 8). When a new to be given. Work is in progress to extend the its constituent change type node is needed. This is implemented by creating proxy version of the pagei is fetched, it is propagated to system to internally handle pages with frames. the links and images nodes. If there is awith sentinels change, This properties the same will be achieved by extending of the original sentinel the system at each of the constituent sentinels subscribed to it are notified. Sentinel to take sentinels which monitor more than one Sand acts as a proxy for S5 . When changeSand typeis node. noti- Consider URL. Work the is scenario also inwhere sentinel progress S5 is interested to provide an in- in links and fied, the change computed is propagated to the dependent fetch module that can be installed on 9 46 84 images change to “page i” (refer Figure 5.10). When the new version of the “page i” is fetched it is propagated to the links and images node. If there is a change, sentinels subscribed to it are notified. Sentinel Sand acts as a proxy for S5 . When Sand is notified the change computed is in turn propagated to the AND node. At the AND node, S5 is informed only when it receives notifications from both its constituent Sand sentinels. on the web. Technical report, AT&T Labs, 1998. [5] S. Gatziu and K. Dittrich. Samos: an ac- tive, object-oriented database system. IEEE Quarterly Bulletin on Data Engineering, 1992. [6] N. Gehani and H. Jagadish. Active database facilities in ode. IEEE Bulletin of the Technical Committee on Data Engineering, 1992. Figure 5.10: Composite Example [7] J. Jacob. Webvigil: Sentinel specificatin Figure 8: Composite Example and user-intent based change detection for thethesystem Following are hosting steps taken a website. when a new This module sentinel is registered will with the system: xml. Master’s thesis, The University of locally detect changes and push that information Texas at Arilngton, 2003. 1). The URL node corresponding to the target page of sentinel is created if there is to the remote WebVigiL server thereby reducing [8] L. Ling, P. Calton, and T. Wei. We- none. the network traffic. bcq: Detecting and delivering information 2). The change type node associated with the target change is obtained; if therechanges is on the web. In Proceedings of In- References ternational Conference on Information and none, a new node is created. Knowledge Management (CIKM), Washing- [1] E. Anwar, L. Maugis, and S. Chakravarthy. ton D.C, 2000. 3). The grouping structure is traversed to obtain the group to which the new sentinel A new perspective on rule support for belongs. If there is no such group object-oriented a new groupInisProceedings, databases. [9] Mind-it. created and the sentinel is http://zen- International Conference on Management eco.com/divingparadise/netminder.htm. of Data, pages 99–108, Washington, D.C., [10] N. Pandrangi et al. Webvigil: User-profile May 1993. based change detection for html/xml docu- [2] S. Chakravarthy et al. Hipac: A research ments. In Proceedings 20th British National project in active,47time-constrained database Conference on Data Bases, Coventry, UK, management, final report. Technical Report 2003. XAIT-89-02, Xerox Advanced Information Technology, Cambridge, MA, Aug 1989. [11] WYSIGOT. http://www.wysigot.com/. [3] S. Chakravarthy and D. Mishra. Snoop: An expressive event specification language for active databases. Data and Knowledge En- gineering, 14(10):1–26, October 1994. [4] F. Douglis et al. The at&t internet differ- ence engine: Tracking and viewing changes 10 85