<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>WebVigil: An approach to Just-In-Time Information Propagation In Large Network -Centric Environments 1</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sharma Chakravarthy</string-name>
          <email>sharma@cse.uta.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jyoti Jacob</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Naveen Pandrangi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anoop Sanka</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Science and Engineering Department The University of Texas at Arlington</institution>
          ,
          <addr-line>Arlington, TX 76019</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Efficient and effective change detection and notification is becoming increasingly important for environments such as WWW and distributed heterogeneous systems. Change detection for structured data has been studied extensively. Change detection and notification for unstructured data in the form of html and XML documents is the goal of this work. The objectives of this work are to investigate the specification, management, and propagation of changes as requested by a user in a timely manner while meeting the quality of service requirements.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Active rules have been proposed as a
paradigm to satisfy the needs of many database
and other applications that require a timely
response to situations. Event–Condition–Action
(or ECA) rules are used to capture the active
capability in a system. The utility and
functionality of active capability (ECA rules)
has been well established in the context of
databases. In order for the active capability to be
useful for a large class of advanced applications,
it is necessary to go beyond what has been
proposed/developed in the context of databases.</p>
      <p>Specifically, extensions beyond the current state
of the art in active capability are needed along
several dimensions:
1. Make the active capability available for
nondatabase applications, in addition to
database applications;
2. Make the active capability available in</p>
      <p>distributed environments
3. Make the active capability available for
heterogeneous sources of events (whether
they are databases are not).</p>
      <p>In this paper, we address 2) and 3) based on our
preliminary architecture.</p>
      <p>There are a number of situations where one
needs to know when changes are made to one or
more documents that are stored in a distributed
(typically heterogeneous) environment. The
numbers of documents that need to be monitored
for changes are large and are spread over
multiple information repositories. The emphasis
here is on selective notification; that is, changes
1This work was supported, in part, by the Office of Naval Research &amp; the SPAWAR System Center–San Diego &amp;
by the Rome Laboratory (grant F30602-01-2-05430), and by NSF (grant IIS-0123730).
are notified to appropriate persons/groups based
upon interest (or profile/policy) that has been
established earlier. Also, there should be a
mechanism for establishing the
interests/profiles/policies. Currently, change
detection is done either manually or by using
queries to check whether any document of
interest has changed (since the last check). This
entails wasted resources and at the same time
does not meet the intended timeliness (where
important) of change detection and associated
notification. Also, quality of service issues
cannot be accommodated in this approach.</p>
      <p>As an example, the above situation is very
common in a large software development project
where there are a number of documents, such as
requirements analysis, design specification,
detailed design document, and implementation
documents. The life cycle of such projects are in
years (and some in decades) and changes to
various documents of the project take place
throughout the life cycle. Typically, a large
number of people are working on the project and
managers need to be aware of the changes to any
one of the documents to make sure the changes
are propagated properly to other relevant
documents and appropriate actions are taken.</p>
      <p>Large software developments happen in
distributed environments. Information retrieval
in the context of the web is another example that
has similar characteristics. Different users may
be interested in knowing changes to specific web
pages (or even combinations there-of), and want
to know when those changes take place. The
approach proposed in this paper will avoid
periodic polling of the web to see whether the
information has changed or not. Some examples
are: students want to know when the web
contents of the courses they have registered for
change; users may want to know when news
items are posted with some specific context they
are interested in. In general, the ability to specify
changes to arbitrary documents and get notified
in different ways will be useful for reducing the
wasteful navigation of web in this information
age. The proposed approach also provides a
powerful way to disseminate information
efficiently without sending unnecessary or
irrelevant information. It also frees the user from
having to constantly monitor for changes using
the pull paradigm.</p>
      <p>Today, information retrieval is mostly done
using the pull technology where the user is
responsible for posing the appropriate query (or
queries) to retrieve needed information. The
burden of knowing changes to contents of pages
in interested web sites is on the user, rather than
on the system. Although there are a number of
systems that send information to interested users
selectively (periodically by airlines, for
example), the approach commonly used is to use
a mailing list to send compiled information.</p>
      <p>Other tools that provide real-time updates in the
web context (e.g., stock updates) are custom
systems that still use the pull technology
underneath to refresh the screen periodically.</p>
      <p>We believe that some of the techniques
developed for active databases, when extended
appropriately along with new research
extensions will provide a solution to the above
class of problems. In addition, there is the
theoretical foundation for event specification,
and its detection in centralized and distributed
environments. The main objective of this
project is to develop the theory, architecture, and
prototype implementation of a selective
propagation approach that can be applied to web
and other large-scale network-centric
environments. We will draw upon the techniques
developed for Sentinel and re-examine them
from a broader, general-purpose context. Some
of the issues that will be investigated in this
project are:
•
•
•
•
•
•</p>
      <p>Development of an approach (both language
and constraints) to specify (primitive)
changes to a hierarchical (XML) document
at different level of granularity. Develop a
GUI, if needed.</p>
      <p>Ability to specify combinations of primitive
changes using a language such as Snoop
which will allow one to specify higher levels
of abstractions of changes (such as
combinations of changes, sequences of
changes, aggregate changes, etc.)
Develop techniques for selective
propagation between a web server and its
browsing clients
Extend the above to propagate selective
changes from one or more web server to
another web server (distributed case)
Develop propagation techniques that take
into account QoS and other constraints
Developing solutions to the above issues
will enable us to develop a general-purpose
solution to selective information propagation
for a large network-centric environment.</p>
      <p>The remainder of the paper is organized as
follows. In section two we give an overview of
related work. In section three we discuss the
push/pull paradigms and their relevance to the
change detection problem on structured
documents. In section four, we present
architecture and discuss the functionality of the
components. Finally, we discuss future work and
draw some conclusions in section 5.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        Many tools have been developed and are
currently available for tracking changes to web
pages. AIDE (AT&amp;T Internet Difference
Engine) developed by AT&amp;T [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] shows the
difference between two html pages. The
granularity of change detection is restricted to a
page in AIDE. It is not possible to view changes
at a finer level of granularity, such as links
within a page, keywords, images, table, lists or
phrases.
      </p>
      <p>Changedetection.com [2] allows users to
register their request and notifies them when
there is a change. We believe that polling (or
timestamp information) is used for detecting
changes to a page of interest. When a change is
detected, the user is notified. The notification
does not include what has changed in the page.
The user is not given a choice of specifying the
type of changes to be tracked on a particular
page. Again, the granularity is a page.</p>
      <p>
        Mind-it [3] and WebCQ [
        <xref ref-type="bibr" rid="ref2">4</xref>
        ] both support
customized change detection and notification.
Mind-it formerly known as URL-Minder is
commercially available. Both these systems
track changes to a finer level of granularity in a
page. They do not support change specification
on multiple pages and combinations of changes
within a page (e.g., phrase change and a link
change). They also do not use active capability
for either detecting changes or propagating
changes.
      </p>
      <p>
        In Xyleme [
        <xref ref-type="bibr" rid="ref3 ref4">5, 6</xref>
        ] , the idea of active paradigm is
being used for detecting changes by evaluation
of continuous/monitoring queries on
XML/HTML documents. The focus is on the
subscription language and continuous queries.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Push/Pull Paradigms</title>
      <p>Traditional approach to information
management has been through the use of a
Database Management System (or a DBMS).
Early DBMSs were developed to satisfy the
needs of certain classes of business applications
(mainly airline and banking industries). The
requirements of these industries were to store,
retrieve, and manipulate large amounts of data
concurrently, and in a consistent manner (plus
allow for failure recovery etc.). Data was stored
in databases and the user had to perform
operations explicitly to retrieve data from the
system. The burden of retrieving relevant
information was on the user. This is the
traditional “pull” paradigm where the user
retrieves information by performing an explicit
action in the form of a query, application, or
transaction execution.</p>
      <p>Even for traditional business applications, such
as inventory control, the pull approach poses
certain limitations. For example, in order to keep
track of an inventory item (to order additional
supplies when the number of widgets falls below
a threshold), one has to periodically check (by
executing a query) to find out how many
widgets are currently present. The traditional
DBMS is not capable of automatically informing
the user that widgets have fallen below a
prespecified threshold. Not surprisingly, this
approach is still heavily used in web navigation,
search, and retrieval.</p>
      <p>Figure 1 indicates a different approach to
information retrieval and management. In this
push paradigm, the user does not have to query
or retrieve information as it changes. The system
is responsible for accepting user needs (in the
form of situations to monitor, business rules,
constraints, profiles, continuous search queries,)
and informs the user (or a set of users) when
something of interest happens. For the widget
example above, the user indicates the threshold
and the notification mechanism. The system
monitors the quantity on hand every time a
widget is sold or returned (only when a change
takes place; not periodically) and informs the
user in a timely manner. This paradigm relieves
the user from frequently querying the data
sources, and shifts the responsibility of situation
monitoring from the user to the system. Of
course, in order to accomplish this, the system
needs to have additional functionality that is not
part of traditional DBMSs. Although this mode
of operation is recognized as beneficial and
results in significantly less data transfers,
accomplishing this for various architectures
(such as distributed, federated and
networkAnswers</p>
      <p>Query
Self-monitoring
Reactive System
Repository/
Web Store</p>
      <p>Business rules
Constraints
Invariants
Situations to monitor</p>
      <p>Updates
Transactions
Applications
centric) requires enhancements to the underlying
system or incorporate agents or mediators that
can carry this out in a non-intrusive manner. In
other words, the system needs to have the
capability to selectively push information. This
is a paradigm shift from how traditional
information systems are architected and
implemented. It is also a paradigm shift from the
users’ viewpoint as well.
3.1</p>
      <sec id="sec-3-1">
        <title>Push-Based Architectures</title>
        <p>Push technology can be introduced into a
system in a number of ways. The approach
primarily depends on the characteristics of the
underlying system in terms of its openness. The
following options can be inferred based on the
underlying system characteristics:</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.1.1 Integrated</title>
        <p>
          In this approach the underlying system is
actually modified to incorporate the push
technology in the form of ECA
(eventcondition-action) rules. This approach assumes
that the source code for the underlying software
is available and the developers have sufficient
understanding of the system to make changes at
the kernel level. For example, the Sentinel
object-oriented active system [
          <xref ref-type="bibr" rid="ref5 ref6 ref7">7-9</xref>
          ] used this
approach on the OpenOODB system from Texas
Instruments [
          <xref ref-type="bibr" rid="ref8">10</xref>
          ]. The sentry mechanism of the
underlying system was extended to introduce
notifications inside the wrapper for each method
to detect primitive events. Once primitive events
were detected, more complex composite events
were detected and rules executed outside of the
underlying system.
        </p>
        <p>The primary advantage of the integrated
approach is its flexibility to add minimum
amount of code and incorporate many kinds of
optimisation that results in good performance.
The footprint for primitive event detection is
small. Some of the functionality needed for
selective push technology (such as deferred
action execution) can be easily incorporated
using the integrated approach.</p>
        <p>
          So far, a number of research prototypes of
active database systems have been developed,
such as HiPAC [
          <xref ref-type="bibr" rid="ref9">11</xref>
          ], Ariel[
          <xref ref-type="bibr" rid="ref10">12</xref>
          ], Sentinel [
          <xref ref-type="bibr" rid="ref11 ref5">7, 13</xref>
          ],
Starburst [
          <xref ref-type="bibr" rid="ref12">14</xref>
          ], Exact [
          <xref ref-type="bibr" rid="ref13">15</xref>
          ], Postgres [
          <xref ref-type="bibr" rid="ref14">16</xref>
          ],
PEARD [
          <xref ref-type="bibr" rid="ref15">17</xref>
          ], SAMOS [
          <xref ref-type="bibr" rid="ref16 ref17">18, 19</xref>
          ] etc. Most of
them are developed from scratch or integrated
directly into the kernel of the DBMS. The
integrated approach provides the following
advantages [
          <xref ref-type="bibr" rid="ref5">7</xref>
          ]:
• Do not require any changes to existing
applications.
• DBMS is responsible for optimizing ECA
rules.
• DBMS functionality is extended.
• Modularity/maintenance of applications is
better and maintenance is easier.
        </p>
        <p>However, the implementation of an integrated
approach requires access to the internals of a
DBMS into which the active capability is being
integrated. This requirement of access to source
code makes the cost of integrated approach very
high and requires a long integration time as well.
Hence, most integrated systems are research
prototypes.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.1.2 Agent-Based/Mediated</title>
        <p>
          The assumption for this approach is that one
does not have access to the source code of the
underlying system. In fact, this is true in many
real-life scenarios where a
commercial-of-theshelf (or COTS) system is being used (relational
DBMS is an example). However, the underlying
system may provide some hooks using whic h
one can incorporate push capability effectively.
We have experimented with this approach in a
number of ways and have developed
mediators/agents [
          <xref ref-type="bibr" rid="ref18">20</xref>
          ] to add full active
capability to a relational DBMS. Intelligent
agents are introduced between the end user
(client) and the system (of course transparently
to the user) and the agent provides additional
capabilities that are not provided by the
underlying system.
        </p>
      </sec>
      <sec id="sec-3-4">
        <title>3.1.3 Wrapper-Based</title>
        <p>For this approach, the assumption is that the
underlying system is a legacy system and as a
result does not support appropriate hooks and
hence it is extremely difficult (and impossible in
most cases) to modify the underlying source
code. Typically, a wrapper (or a whopper) is
built which interfaces to the outside world and
push capabilities are added to this wrapper. The
wrapper in turn uses the API of the underlying
legacy system and may add some additional
functionality, not provided by the underlying
system (sorting, for example). This approach
needs a good understanding of the underlying
system and the wrapper has to be developed for
each legacy system separately. This approach is
not preferred unless this is the only alternative to
bring the system on par with other systems to
bring the legacy system into a federation or a
distributed environment.</p>
        <p>WebVigil is a change detection and
notification system, which can monitor and
detect changes to unstructured documents in
general. The current work addresses
HTML/XML documents that are part of a web
repository. WebVigil aims at investigating the
specification, management, and propagation of
changes as requested by the user in a timely
manner while meeting the quality of service
requirements. Figure 2 summarizes the high
level architecture of WebVigil. Users specify
their interest in the form of a sentinel that is used
for change detection and presentation.
Information from the sentinel is extracted and
stored in a data/knowledge base and is used by
the other modules in the system.</p>
        <p>User
specification
Presentation/</p>
        <p>Notification
Data/Knowledge</p>
        <p>Base
ECA Rule Generation
Change detection
Caching and Management</p>
        <p>Event based Fetching
The functionality of each module in the
architecture is described briefly in the following
sections.
4.1</p>
      </sec>
      <sec id="sec-3-5">
        <title>User specification</title>
        <p>Users may wish to track changes to a given
page with respect to links, words, keywords,
phrase, images, table(s), list(s), or any change.
We define such a request from the user as a
sentinel. The user creates a sentinel to define the
changes of interest with respect to a page. A
partial syntax of the sentinel is shown in Figure
3. The system generates a unique identifier for
every sentinel. The sentinel-target specifies the
Url to be monitored for change detection.
Sentinel type can be a primitive change (links,
images…) or a composite change (combination
of primitive changes using options such as
AND, OR and NOT). The lifespan of the
sentinel can be periodic (from a fixed point of
time to another fixed point of time) or aperiodic
(from and to activation/termination of other
sentinels set by the same user). Once the sentinel
is initialised, it becomes active when the
condition associated with it becomes true.
The Notify of a sentinel specifies the frequency
with which the user wishes to be informed of
changes. The “notify options” gives the users a
set of methods for change notification. The
sentinel is set with default settings unless stated
otherwise by the user. The default settings
being:
• FROM: time at which sentinel is initiated.
• NOTIFY: Immediate.
• BY: e-mail.</p>
        <p>The Immediate indicates that the user should be
notified as soon as the page changes. Of course,
there may be a small interval between the
change occurrence and detection by the
WebVigil. We plan on quantifying this
difference more formally and validate through
experiments. If an interval is specified, the user
is notified using the interval even if the page
changes several times during that interval.
Consider the following scenario: Jill wants to be
notified daily by e-mail for change in links and
images to the page “http://www.gallery.com”
starting from Feb 2,2002 to Mar 2, 2002.The
sentinel for the above scenario is as follows
Create Sentinel sen_1</p>
        <p>ON “http://www.gallery.com”
MONITOR links AND images
FROM Feb 2, 2002
TO Mar 2, 2002
NOTIFY every day</p>
        <p>BY email jill@aol.com
4.2</p>
      </sec>
      <sec id="sec-3-6">
        <title>Data/Knowledge Base (D/KB)</title>
        <p>Knowledge Base is a persistent repository
containing meta-data about each user, number
and names of sentinels set by each user, and
details of the contents of the sentinel (frequency
of notification, change type etc.). User input is
parsed and required information is extracted and
stored for later use. For example, for each Url, it
stores the following parameters: last modified
date, last check time, checksum, and frequency
of checks. D/KB may also store notification
method and notification frequency for each
&lt;user-Url&gt; pair. The D/KB also acts as a
persistent store so that all the memory resident
information can be regenerated in case of a
system crash. The rest of the modules of
WebVigil use the D/KB for information needed
at run time. AIDE maintains a relational
database containing information about each
page, each user and relationship between them.
4.3</p>
        <p>
          We plan on using ECA rules and event
detection approach in two places; i) rules for
retrieving pages in an intelligent manner based
on the user specification (e.g., user frequency
coupled wit h whether the page has changed in
that interval) and ii) for propagating pages to
detect higher level changes. ECA rules will help
us to propagate changes requested by the user in
a timely manner. In WebVigil ECA rule
generation module uses the concepts defined in
[
          <xref ref-type="bibr" rid="ref6 ref7">8, 9</xref>
          ] to provide the required active capability.
This module constructs and maintains “change
detection graphs” which keep track of
relationships between the pages and sentinels.
Each node specifies the change requested in the
sentinel on that page. In a change detection
graph, the leaf node represents the page of
interest and non-leaf nodes represent operators
for various types of changes (e.g., phrase change
is an operator).
        </p>
        <p>S1</p>
        <p>S2</p>
        <p>S3
P1</p>
        <p>P2
Figure 4 shows a change detection graph for a
page P1 where nodes S1 and S2 represent the
type of change detection requested by sentinels
present on P1. For every leaf node Pi a periodic
or aperiodic rule Ri is generated with the event
part of the rule specifying the frequency and the
action part with calls to the fetch procedure
followed by a notification to the change
detection graph, if necessary.
4.4</p>
        <p>Detection algorithms have been developed to
detect changes between two versions of a page
with respect to a change type. For a change to be
detected the object of interest is extracted from
the given versions of the page depending upon
the change type. Figure 5 shows the change
types that are identified and supported in the
current prototype of WebVigil. Change to links,
images, words and keyword(s) is captured in
terms of insertion or deletion.</p>
        <p>
          Object identification, extraction and change
detection is complicated for phrases. For
identifying an object (phrase) in a given page we
use the words surrounding it as its signature. We
assume that these words are relatively stable.
WebCQ [
          <xref ref-type="bibr" rid="ref2">4</xref>
          ] uses the concept of a bounding box
to tackle this problem. Change to table and list is
specified in terms of an update made to their
contents. An insertion of a new table or list is
not captured under this change type. For phrase
change an insert or delete indicates appearance
or disappearance of the complete phrase in the
page. Currently the change detection algorithms
are being reviewed for better performance and
for scale up. Abiteboul et al [
          <xref ref-type="bibr" rid="ref19">21</xref>
          ] detect changes
at the page level and insertions at the node level
and is somewhat different from our focus.
4.5
        </p>
      </sec>
      <sec id="sec-3-7">
        <title>Caching and Management of pages</title>
        <p>An important feature of WebVigil architecture
is its centralized server based repository service
that archives and manages versions of pages.
WebVigil retrieves and stores only those pages
needed by a sentinel. The primary purpose of the
repository service is to reduce the number of
network connections to the remote web server,
there by reducing network traffic. When a
remote page fetch is initiated, the repository
service checks for the existence of the remote
page in its cache and if present, the latest version
of the page in the cache is returned. In cases of
cache miss, the repository service requests that
the page be fetched from the appropriate remote
server. Subsequent requests for the web page
can access the page from the cache instead of
repeatedly invoking a fetch procedure.</p>
        <p>
          The repository service reduces network traffic
and latency for obtaining the web page because
WebVigil can obtain the “Target Web Pages”
from the cache instead of having to request the
page directly from the remote server. The
quality of service for the repository service
includes managing multiple versions of pages
with out excessive storage overhead.
WebGUIDE [
          <xref ref-type="bibr" rid="ref20">22</xref>
          ] manages versions of pages by
storing pages in RCS [
          <xref ref-type="bibr" rid="ref21">23</xref>
          ] format.
4.6
        </p>
      </sec>
      <sec id="sec-3-8">
        <title>Page Retrieval</title>
        <p>WebVigil uses a wrapper for the task of
retrieving the pages registered with it. The
wrapper is responsible for informing WebVigil
about changes in the properties of the pages. By
properties, we mean the size of the page and last
modified time stamp. When there is change in
time stamp of the page with an increase or
decrease in page size, the wrapper notifies
WebVigil of the change, which then fetches and
caches the page. In cases where time stamp is
modified, but the page size remains the same,
the wrapper informs this as a change. WebVigil
fetches and calculates the checksum of the page.
The page is cached only if the calculated
checksum differs from the checksum of the
cached copy of this page.</p>
        <p>For dynamically generated pages, WebVigil
directly fetches the page without using the
wrapper, as page properties are not available. It
then checks for change by calculating the
checksum of the page. The wrapper may,
depending on the paradigm (Push/Pull) be either
located at the web server or be a part of
WebVigil. Irrespective of its location the
primary function of the wrapper is to retrieve
metadata and inform WebVigil of the change in
page properties. WebVigil in turn fetches and
caches pages of interest.</p>
        <p>Web Server
1.Retrieval of</p>
        <p>Page properties
2.Page properties
3.Retrieval of page</p>
        <p>W
R
A
P
P
E
R</p>
        <p>WebVigil
In the pull approach the wrapper is located at the
WebVigil. It polls and pulls the properties of the
pages from the remote web server Figure 6
illustrates this approach.</p>
        <p>In the push approach the wrapper is located at
the remote web server. The wrapper is assumed
to know all those pages that are registered with
WebVigil and belong to the web server on
which it resides. It informs (pushes) the change
information to WebVigil. Figure 7 illustrates
this approach. The localization of the wrapper is
a trade off between communication, processing
and storage. At first glance it may seem obvious
that the localization of wrapper should be used,
but the cost of polling and network cost may be
crucial in which case remote wrapper will be
preferable. We intend to develop both local and
remote wrappers, evaluate their performance,
and use them appropriately.</p>
        <p>Web Server</p>
        <p>W
R
A
P
P
E
R
1.Send Page</p>
        <p>Properties
2.Retrieval of page</p>
        <p>WebVigil</p>
        <p>The presentation method selected should
clearly state the detected differences between
two web pages to the user. Therefore, computing
and displaying the detected differences is very
important. In this section, issues related to
displaying and notifying the detected changes
are discussed.
4.7.1</p>
      </sec>
      <sec id="sec-3-9">
        <title>Presentation</title>
        <p>
          Different methods of displaying changes used
by the existing tools are: 1.) Merging two
documents, 2.) Displaying only the changes 3.)
Highlighting the differences in both the pages
[
          <xref ref-type="bibr" rid="ref1 ref2">1, 4</xref>
          ]. Summarizing the common and changed
data into a single merged document has the
advantage of displaying the common portions
only once [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. HTMLdiff [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] and Unixdiff [
          <xref ref-type="bibr" rid="ref22">24</xref>
          ]
use this style to display detected changes. The
disadvantage of this approach is that it is
difficult for the user to view the changes when
they are large in number.
        </p>
        <p>Displaying only the computed differences is a
better option when the user is interested in
tracking changes to multiple pages or when the
number of changes is large. But, highlighting the
differences by displaying both the pages
sideby-side is preferable for changes like “any
change” and “phrase change”. In this case, the
detected differences can be perceived better if
the change in the new page is shown relative to
the old page.</p>
        <p>Because WebVigil will track multiple types of
changes on a web page, and eventually notify
using different media (email, PDA, laptop etc.),
combination of all presentation styles discussed
above will be relevant, as the information to be
notified will vary depending on factors like
notification method, number of detected
differences and type of changes.
4.7.2</p>
      </sec>
      <sec id="sec-3-10">
        <title>Notification</title>
        <p>What, When and How to notify are three
important issues for proper notification. These
issues are discussed below:</p>
      </sec>
      <sec id="sec-3-11">
        <title>4.7.2.1 Presentation Content</title>
        <p>Presentation content should be concise and
lucid. Users should be able to clearly perceive
the computed differences in the context of
his/her predefined specification. The notification
report could contain the following basic
information:
•
•
•
•</p>
        <p>The change detected in the latest page
relative to the reference page
User specified type of change like “any
change”, “all words” etc.</p>
        <p>URL for which the change detection module
is invoked.</p>
        <p>Small summary explaining the detected
change. This could include statuses of
changes such as Insert, Delete and Changed
for certain type of user-defined types of
changes like “images”, “all links” and
“keywords” or/and the different timestamps
indicating the modification, polling, change
detection and notification date.</p>
        <p>The size of the notification report will depend
upon the maximum information that can be sent
to a user by satisfying the network quality of
service requirements.</p>
      </sec>
      <sec id="sec-3-12">
        <title>4.7.2.2 Notification frequency</title>
        <p>A detected change can be notified in two
ways:
• Notify immediately when the change is
detected
• Notify after a fixed time interval.</p>
        <p>The user may want to be notified immediately
of changes on particular pages. In such cases,
immediate notification should be sent to the
user. Alternatively, frequency of change
detection will be very high for web pages that
are modified frequently. Since frequent
notification of these detected changes will prove
to be a bottleneck on the network, it is preferable
to send notification periodically. Thus the user
can specify the notification interval in the
sentinel.</p>
      </sec>
      <sec id="sec-3-13">
        <title>4.7.2.3 Notification methods</title>
        <p>Different notify options like email, fax, PDA
and web page can be used for notification.
Notification can be initiated either by the server
or by the client. In WebVigil, server based push
initiation is considered. The server, based on the
notification frequency can push the information
to the user, thus propagating the changes “just
in time”(JIT).
5
5.1</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusions and Future Work</title>
      <sec id="sec-4-1">
        <title>Conclusions</title>
        <p>
          The basic architecture of WebVigil has been
designed to track and propagate changes on
unstructured documents as requested by the user
in a timely manner, meeting the quality of
service requirements. The design accommodates
specification of multiple types of changes, on
multiple web pages (composite events). The
existing event specification language “SNOOP”
[
          <xref ref-type="bibr" rid="ref6 ref7">8, 9</xref>
          ] will be used for specifying composite
events.
        </p>
        <p>The design and implementation of the system
will address the issues regarding scalability and
user flexibility. Implementation of WebVigil
will augment the current strategy of pulling
information</p>
        <p>periodically and checking for
interesting changes.
method detects changes between the current and
the last changed page. This method can be
improved upon by giving the user the choice to
select the reference page. The user can specify a
fixed reference page and
must have the
flexibility to change the reference. The moving
window</p>
        <p>concept for tracking changes in
WebVigil can be improved by allowing a page
to be used as reference for detecting changes for
the next n pages where user will define n. After
changes are detected in n pages, the nth page
becomes the reference page. Consider the
following scenario:
Jill wants to use the first version of the page as
reference. He wants to track changes for the
next five revisions to the page with this
reference. After five changes, the reference page
should be the fifth page and the next five
changes should be tracked relative to this page.
An added feature will be to notify the user of
cumulative changes. The user can be given the
option of being notified of cumulative n changes
where n should be specified in the sentinel.
Additional feature like user’s personalized
change summary page can be provided. The user
can lookup this page to get the history of his
installed sentinels and the changes tracked till
date.
2. Changedetection,
http://www.changedetection.com.
3. Mind-it, http://www.netmind.com/.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Douglis</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          , et al.,
          <string-name>
            <surname>The</surname>
            <given-names>AT</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>T Internet Difference</surname>
          </string-name>
          <article-title>Engine: Tracking and Vie wing Changes on the Web</article-title>
          . in World Wide Web.
          <year>1998</year>
          , Baltzer Science Publishers. p.
          <fpage>27</fpage>
          -
          <lpage>44</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          4.
          <string-name>
            <surname>Liu</surname>
            , L.,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Pu</surname>
            , and
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Tang</surname>
          </string-name>
          .
          <source>WebCQ: Detecting and Delivering Information Changes on the Web. in the Proceedings of International Conference on Information and Knowledge Management (CIKM)</source>
          .
          <year>2000</year>
          . Washington D.C: ACM Press.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>5. Xyleme, http://www.xyleme.com/.</mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          6.
          <string-name>
            <surname>Nguyen</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          , et al.
          <article-title>Monitoring XML Data on the Web</article-title>
          .
          <source>in Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data</source>
          .
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          7.
          <string-name>
            <surname>Chakravarthy</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , et al.,
          <article-title>Design of Sentinel: An Object-Oriented DBMS with Event-Based Rules</article-title>
          .
          <source>Information and Software Technology</source>
          ,
          <year>1994</year>
          .
          <volume>36</volume>
          (
          <issue>9</issue>
          ): p.
          <fpage>559</fpage>
          --
          <lpage>568</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          8.
          <string-name>
            <surname>Chakravarthy</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , et al.,
          <article-title>Composite Events for Active Databases: Semantics, Contexts and Detection</article-title>
          ,
          <source>in Proc. Int'l. Conf. on Very Large Data Bases VLDB</source>
          .
          <year>1994</year>
          : Santiago, Chile. p.
          <fpage>606</fpage>
          --
          <lpage>617</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          9.
          <string-name>
            <surname>Chakravarthy</surname>
            , S. and
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Mishra</surname>
          </string-name>
          ,
          <article-title>Snoop: An Expressive Event Specification Language for Active Databases</article-title>
          .
          <source>Data and Knowledge Engineering</source>
          ,
          <year>1994</year>
          .
          <volume>14</volume>
          (
          <issue>10</issue>
          ): p.
          <fpage>1</fpage>
          --
          <lpage>26</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          10.
          <string-name>
            <surname>Wells</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>J.A.</given-names>
            <surname>Blakeley</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.W.</given-names>
            <surname>Thompson</surname>
          </string-name>
          ,
          <article-title>Architecture of an Open Object-Oriented Database Management System</article-title>
          .
          <source>IEEE Computer</source>
          ,
          <year>1992</year>
          .
          <volume>25</volume>
          (
          <issue>10</issue>
          ): p.
          <fpage>74</fpage>
          --
          <lpage>81</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          11.
          <string-name>
            <surname>Chakravarthy</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , et al.,
          <source>HiPAC: A Research Project in Active, Time -Constrained Database Management (Final Report)</source>
          .
          <year>1989</year>
          , Xerox Advanced Information Technology.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          12.
          <string-name>
            <surname>Hanson</surname>
          </string-name>
          , E.,
          <source>The Ariel Project, in Active Database Systems - Triggers and Rules For Advanced Database Processing</source>
          .
          <year>1996</year>
          , Morgan Kaufman Publishers Inc. p.
          <fpage>63</fpage>
          --
          <lpage>86</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          13.
          <string-name>
            <surname>Anwar</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Maugis</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Chakravarthy</surname>
          </string-name>
          ,
          <string-name>
            <surname>A New</surname>
          </string-name>
          <article-title>Perspective on Rule Support for ObjectOriented Databases</article-title>
          , in
          <source>1993 ACM SIGMOD Conf. on Management of Data</source>
          .
          <year>1993</year>
          : Washington D.C. p.
          <fpage>99</fpage>
          -
          <lpage>108</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          14.
          <string-name>
            <surname>Widom</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <source>The Starburst Rule System, in Active Database Systems - Triggers and Rules For Advanced Database Processing</source>
          .
          <year>1996</year>
          , Morgan Kaufman Publishers Inc. p.
          <fpage>87</fpage>
          --
          <lpage>110</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          15.
          <string-name>
            <surname>Diaz</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Paton</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Gray</surname>
          </string-name>
          ,
          <article-title>Rule Management in Object-Oriented Databases: A Unified Approach</article-title>
          ,
          <source>in Proceedings 17th International Conference on Very Large Data Bases</source>
          .
          <year>1991</year>
          :
          <article-title>Barcelona (Catalonia, Spain)</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          16.
          <string-name>
            <surname>Stonebraker</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          and
          <string-name>
            <surname>G. Kemnitz,</surname>
          </string-name>
          <article-title>The Postgres Next -Generation Database Management System</article-title>
          .
          <source>Communications of the ACM</source>
          ,
          <year>1991</year>
          .
          <volume>34</volume>
          (
          <issue>10</issue>
          ): p.
          <fpage>78</fpage>
          --
          <lpage>92</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          17.
          <string-name>
            <surname>Alexander</surname>
            ,
            <given-names>S.D.</given-names>
          </string-name>
          <string-name>
            <surname>Urban</surname>
            , and
            <given-names>S.W.</given-names>
          </string-name>
          <string-name>
            <surname>Dietrich</surname>
          </string-name>
          ,
          <article-title>PEARD: A Prototype Environment for Active Rule Debugging</article-title>
          .
          <source>Intelligent Information Systems : Integrating Artificial Intelligence and Database Technologies</source>
          ,
          <year>1996</year>
          .
          <volume>7</volume>
          (
          <issue>Number 2</issue>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          18.
          <string-name>
            <surname>Gatziu</surname>
            , S. and
            <given-names>K.R.</given-names>
          </string-name>
          <string-name>
            <surname>Dittrich</surname>
          </string-name>
          ,
          <article-title>Events in an Active Object-Oriented System, in Rules in Database Systems</article-title>
          .,
          <string-name>
            <given-names>N.</given-names>
            <surname>Paton</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Williams</surname>
          </string-name>
          , Editors.
          <year>1993</year>
          , Springer. p.
          <fpage>127</fpage>
          --
          <lpage>142</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          19.
          <string-name>
            <surname>Gatziu</surname>
          </string-name>
          .S and
          <string-name>
            <surname>K.R.Dittrich</surname>
          </string-name>
          ,
          <article-title>SAMOS: an Active, Object-Oriented Database System</article-title>
          ,
          <source>in IEEE Quarterly Bulletin on Data Engineering</source>
          .
          <year>1992</year>
          . p.
          <fpage>23</fpage>
          --
          <lpage>26</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          20.
          <string-name>
            <surname>Li</surname>
            , L. and
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Chakravarthy</surname>
          </string-name>
          .
          <article-title>An Agent-Based Approach to Extending the Native Active Capability of Relational Database Systems</article-title>
          . in ICDE.
          <year>1999</year>
          . Australia: IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          21.
          <string-name>
            <surname>Cobena</surname>
            , G.,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Abiteboul</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Marian</surname>
          </string-name>
          ,
          <source>Detecting Changes in XML Documents. Data Engineering</source>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          22.
          <string-name>
            <surname>Douglis</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          , et al.,
          <article-title>WebGUIDE: Querying and Navigating Changes in Web Repositories</article-title>
          . in Fifth International World Wide Web Conference.
          <year>1996</year>
          . Paris, France.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          23.
          <string-name>
            <surname>Tichy</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <article-title>RCS: a system for version control</article-title>
          ,
          <source>in Software -Practice &amp; Experience</source>
          .
          <year>1985</year>
          . p.
          <fpage>637</fpage>
          -
          <lpage>654</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          24.
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Hunt and M.D.Mcllroy</surname>
          </string-name>
          ,
          <article-title>An algorithm for efficient file comparison</article-title>
          .
          <year>1995</year>
          , Bell Laboratories: Murray Hill,
          <string-name>
            <surname>N.J.</surname>
          </string-name>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>